Hello, in this article, I'd like to share with you a real-world example of how Datahunt utilized OCR technology for data processing and how it worked for KEPCO.
What is OCR?
Optical CharacterRecognition (OCR) technology is a computer algorithm used to recognize and extract text from photos, scanned documents, videos, and more. This technology is an optical character reader, which converts characters visible to the human eye into a data format that can be read by a computer.
OCR is widely used in a variety of industries and applications, and plays an important role in document management, automated data entry, print digitization, translation, text mining, robotic process automation, and more. For example, it can be used to extract content from scanned images of various documents and leave them as digital information, or it can be used to enhance security by identifying sensitive personal information such as licenses, social security cards, etc.
The OCR process
The main steps of OCR technology are as follows
- Preprocessing: Improves the quality of the image and removes noise to increase the character recognition rate. This step includes binarizing the image, blurring, removing background, etc.
- Character segmentation: The process of finding and separating character regions in an image. This step includes detecting text regions, segmenting lines, and segmenting characters.
- Character recognition: Recognizing the actual characters from the segmented text regions. This process uses an Image Classification model.
- Post-processing: Refine the recognized characters and derive the correct results based on the context. This step uses word checking, grammar checking, and language models to correct errors and generate the final output.
At Datahunt, we are conducting R&D on all of the above processes to create better OCR models.
OCR Processing inAction - KEPCO
Last year, Datahunt worked with KEPCO on an OCR project, which was expected to be very labor-intensive as tens of thousands of images needed to be labeled.
The task was to regionize all the text in the images and attribute everything that fell into a given category, a difficult task that actually took at least five minutes per image for one of Datahunt's experienced operators. If the quantity of images to be processed was 100,000, this would take more than 8,000 hours in total, so this is a task where AI really shone through.
How we improved OCR processing quality
At DataHunt, we took several steps to improve our OCR model. The steps were as follows
- detect if the text in the image is skewed overall
- apply an ensemble technique utilizing multiple models
- added parameterization to adjust the level of the bounding box (syllable, word, sentence) based on the data.
This led to faster and more accurate work.
The importance of data preprocessing for OCR
While the model itself is a performance, preprocessing was more important for the OCR task than for the other tasks. The reason is that for each text, we draw a bounding box or a polygon, and in general, for either task, a polygon is harder to work with than a bounding box. Is it because it's harder for humans to work with too?Leaning letters have less polygon accuracy and less transcription accuracy than letters that are standing upright.
- Polygons are harder to work with (create/modify) than bounding boxes.
- AI models have difficulty with skewed text
Therefore, if you can straighten the text before asking the AI to do the work, it will be able to create more accurate labels.
While this can definitely improve your results, you may still have questions about this process, which isn't all automatic. To answer in advance,
there are several improvements beyond preprocessing, but I've only outlined them here.
OCR improvement results
To see how much the above process helped our OCR work, we measured the efficiency of our work by directly comparing the time with and without AI. For reference, we performedOCR work with Korean characters.
The results were surprising: all workers saw a significant reduction in work time! While there was some variation, the average reduction was around 40%, which leads to the conclusion that with a little extra effort, you can turn one image into two in the same amount of time.
This conclusion has implications not only in terms of cost or time, but also in terms of accuracy, because even if you give your workers plenty of breaks, they'll quickly get tired and make more mistakes when they're doing a lot of work. However, as mentioned above, you can expect to see more accurate work as a result of easier work.
In this article, we've explained what OCR is, the process and real-world examples, and how we use AI to streamline our work at DataHunt, even down to the quantitative level.