What is Optical Character Recognition (OCR) ?
Optical character recognition (OCR) refers to the process of converting a text image into text in a format that a computer can read. For example, if you scan a physical printout or receipt, your computer saves the scan as an image file. In an image file, you can't use a text editor to edit the words, search, or count the number of words. However, with OCR, you can convert an image to a text document, store it as text data, and do a variety of things with it.
OCR = Text detection + Text recognition
The OCR engine has several OCR models and algorithms that perform step-by-step tasks. Text detection and text recognition are the most central tasks, and the overall workflow is shown below.
- Pre-processing: OCR's input images are often corrupted by noise, skewed, or even rotated. In such cases, the next step will not produce good results, so we repair the corrupted image or repair the rotated or skewed image beforehand.
- Text Detection: Text detection은 Object detection can be thought of as an extension of this. Text recognition refers to the process of recognizing what a character in a detected area is. OCR is specifically a combination of classification, detection, and Segmentation techniques. More recently, we've seen the development of end-to-end models that do both at the same time, whether it's to improve speed, tune processes, or other cases.
- Text Recognition: In this step, the detection, segmentation to recognize what text is within the image boundaries. As a result of your work, you will typically get a text file that lists the coordinates and areas within the image with text.
- Restructuring: Restructuring means repositioning the text according to the coordinates it was in the input image. All you need to do is iterate over each coordinate to place the recognized text. The restructured data will be structurally similar to the original image and will help you recognize the context of the information in the text.
Why OCR is here to stay
OCR has been around for over 20 years now. Technological advancements have made it possible to use OCR in images, the technology can recognize everyday photos or Video, and it's getting better and better at recognizing characters in video. Why is OCR technology important?
Make your business more efficient
OCR은 Maximize process efficiency and reduce costs. Document processing, especially for proofs, is a common practice in many organizations, especially as mobile usage becomes more popular, and scanned images are being replaced by mobile-captured images for businesses to process. Instead of manually entering the text contents into an Excel file or database for internal documentation of the collected images, applying OCR technology can greatly increase productivity.
In particular, OCR can help streamline processes and Automation side, RPA. RPA is a work automation technology that designs simple, repetitive work processes to be performed by software robots instead of humans. When paired with RPA, OCR can go beyond storing and sending recognized values and automate the entire process.
Fine Tuning LLM Training Data
OCR technology also extracts text from scanned documents and images to create Large-scale language models (LLMs) can be used to train and fine-tune. LLMs are generative AI that can generate text, translate languages, write different kinds of creative content, and answer questions. ChatGPT, Bard, PaLM, and other foundation models.
These LLMs, or large-scale language models, train on massive datasets of text and code, and the quality of the data they train on is critical to their performance. OCR technology can provide LLMs with high-quality data in a variety of formats, including scanned documents, images, and PDFs. This data can be used to train LLMs on a variety of tasks, including text classification, NLP, and Q&A.
OCR solutions and use cases
PDF OCR is the automatic application of optical character recognition (OCR) to documents and converting them into editable PDF copies. The benefit of OCR in a PDF scanner is that it converts images into text data that can be read by other business software, saving individuals and businesses time and money resources.
Here's a quick list of free PDF scanners that you can easily use in your everyday life.
- ~Browser-activated, no software installation required
- ~Desktop applications are available when offline use is required
- ~Compatible with Linux and smartphones
- ~Convert online in a web browser, work offline with a desktop application
- ~Send and share OCRed documents to others via email
- ~Convert recognized PDFs to HTML, DOCX, TIFF and edit them in a web browser
- Support for ~100+ formats
- ~Choose your OCR language from English, French, German, Spanish, Italian, and Portuguese
- ~Convert PDF to PPT, Word, JPG, PDF/A, PNG, TIFF, Grayscale, Excel, TXT
- ~Max file size 128MB, up to 1 free file per hour, 10 files per processing
- ~High recognition quality using the Aspose API
- ~Correct low resolution, low contrast, noise, and skew by performing automatic image preprocessing
- ~Automated document layout detection algorithm
OCR system with ChatGPT
ChatGPT is a conversational AI service trained with OpenAI on books, code, and Google search results. This model, combined with OCR technology, makes it ideal for post-processing raw text data extracted from images. By combining GPT with an OCR system, scanned documents can be automatically sent to the API from computers and phones.
The OCR model built on LLM trains on vast dataset to gain a deep understanding of the structure of the language, which can improve the accuracy of the OCR model in challenging text output, such as handwritten text or low-quality data. The process of learning the underlying principles of a language also enables it to recognize text in new contexts. For example, you can provide GPT with OCR data and then ask it questions about the details of the text.
Of course, LLM-based OCR models have their advantages and limitations. They respond accurately to challenging input, such as handwritten text or low-quality images. They can recognize text in a wider range of languages than traditional OCR models, which are only good at a few types of words, including English. However, they can be expensive to train and deploy, and I need a large amount of training data. Make sure that the data you input into the model matches the training data and If it's very different, it might not recognize it.
NAVER is focusing on improving the model according to the actual usage environment and practicality. The document understanding OCR service provided by Clova AI has separated the API according to the type of document. This means that the pre-processing and post-processing methods are set differently depending on the type of document, such as receipts, credit cards, business licenses, bills, business cards, and ID cards.
Naver's Text detection model first processes data with a watershed-based training data preprocessing method, and then learns using a U-Net-based network. The Text recognition model then combines the models to recognize the detected text and proposes an objective benchmark method. According to the paper, the TPS+ResNet+BiLSTM+CTC combination is the best.
Kakao has been focusing on real-time by designing it with characteristics similar to YOLO. Kakao first used a primary recognition model, RNN Layer, but we have been researching to improve its dependent structure and the limitation that it can only pull out letters sequentially. As a result, we succeeded in maintaining performance and improving speed through a self-attention structure.
Kakao's OCR configuration keywords can be categorized as follows
- Structure: Separate Text detection, Text Recognition models
- Text detection model: Character-level Output, Model-based Clustering, Orientation Prediction, Simple Postprocessing (No NMS)
- Text recognition model: Less Resources(No TPS), Transformer Only Model, Fixed Length Input
Google, Microsoft, meta
Google has also built OCR technology using multi-heads in a similar way to Kakao. Word와 Character별로 Bounding box respectively, and then we could see that it was performing recognition. In addition, Google is also trying to use Attention in Vision Tasks through Vision-transformer, etc. and is also utilizing it for OCR.
Microsoft Azure acquires text detection regions on a paragraph-by-paragraph basis and returns line detection, word-by-word results. This is somewhat different from Google's approach described earlier, and it's a hot topic because it can be applied to some structured text. In other words, Microsoft is focusing on an OCR model that specializes in structured text.
META is the image-text, region-text-QA structure of the Multimodal. The results have been surprising in that multi-modal training on TextVQA data with this structure can outperform the same model by up to 20% or more.
Deep learning and OCR
Before Deep learning
Before deep learning, OCR used to use Tesseract OCR as its OCR engine. This technology was created by Hewlett Packard, open-sourced, developed by Google, and continues to be researched. If you're interested in learning more about the history of OCR, check out here.
Before deep learning was utilized for OCR, how did we find characters in images? If you're interested, here's the Link to paper.
The rough architecture of the OCR we'll be discussing in this article is shown below.
- Adaptive Thresholding: Binarize and convert input footage to black and white
- Connected Component Analysis: Detect character regions
- Find Lines and Words: Extract lines or words
- Recognize Word: Recognize and recombine characters one by one to convert word-by-word images to text
These are the basic steps that also apply to typical deep learning-based OCR. The goal and basic flow is the same, but the methodology is different because it was done in the days before deep learning.
Text detection, which uses deep learning to find the location of text in an image, is similar to object detection or segmentation technique, it has similar properties, but different details.
Since characters come together to form words or sentences, detecting characters in an image requires setting a minimum unit size;
[go] , [goo], [mah], [mah], [is], [flavor], [have], [have], [da].
For example, if we were to detect each letter in isolation, as shown above, we would have to put them together in context.
The figure above is from the 2017 EAST:An Efficient and Accurate Scene Text Detector, which was published in 2017. At that time, the main focus was on finding the bounding box of the text, so there are various methods for finding horizontally oriented text boxes and multi-oriented text boxes. In particular, you can see that both word-level and character-level detection are being utilized.
How does word-level detection differ from character-level detection?
Sentence or word-level detection
- Regression-based detection in ~Object detection
- be able to find shorter lengths alongside longer words or sentences
- Detect words by defining an anchor and estimating the presence of words and the size of the bounding box.
- Use bounding box regression
- We can find the letters without missing them, but we need to put them back into context.
- Approach by segmenting the text area
The images of the letters have information in letter order. However, some images may not be segmented because it would be costly or difficult to do so.
Text recognition can be performed on such data by using a combination of convolutional neural networks (CNN) and recurrent neural networks (RNN). 이런 모델을 CRNN이라고 합니다. CRNNs are able to guess with a high degree of probability which character the input image is, and thus provide a data processing과 Deep learning.
In addition, the combination of attention and text recognition models is having a major impact on the field of deep learning OCR. Attention-based sequential prediction is a method that fixes the length of a sentence and predicts the label of the corresponding character based on the attention to the incoming features. It estimates the Label based on the Attention to the first input feature, and re-inputs the estimated Label to estimate the next character.
For example, suppose you want to predict the sentence "H". Attention-based sequential prediction would estimate the label "e" from the first character, "l", based on the attention to the input features. It uses the estimated label "l" as input again to estimate the next character, "o". Repeat this process to predict the sentence "Hello".
You can learn more about the performance of the combination of attention and text recognition models in Naver Clova's paper.
Latest OCR research papers
Below are technical briefs and papers that have recently gained traction, with more details available at each paper's link.
- ~How I implemented a segmentation-based approach to locating and concatenating characters on a character-by-character basis.
- ~Learn character regions by creating and placing circular score maps without implementing them explicitly.
- ~Utilizing Workly supervised learning
- Detect text regions as a region proposal network using the structure of ~Mask-RCNN
- ~After regression and classification in the box head to be more accurate, in the mask head to instance segmentation.
- ~Using soft-segmentation to account for inaccurate mask information
How to improve OCR data quality
Recently developed OCR systems have seen further performance improvements with the addition of error correction models. For example, if an OCR system mistakenly recognizes a character written as B as A, an error correction model can identify this error by comparing the probability of the character being A to the probability of it being B in the document. In some cases, employing an Error Correction model can reduce the error rate of an OCR system by up to 50%.
Also, if the text being recognized has context to understand, or if we're in a specific Wrong Case, we're pinpointing it. If you have statistical data on error cases, this is a welcome improvement. You can take a passive approach, a word-level approach, a contextual approach, etc. There are many other ways to improve OCR data quality, and here are two of DataHunt's favorite ways to improve OCR accuracy.
There are many ways to improve accuracy, and by analyzing the basic performance of your OCR model, you can identify issues and improve performance. Here's a summary of how performance evaluation is typically done in the past.
- Detection evaluation method (IoU, Intersection over Union): Check how much the correct answer and prediction boxes overlap
- Recognition Evaluation Method (WEM, Word based Exactly Matching): Word-based check if the correct answer and predicted words match exactly.
- Recognition evaluation method (1-NED, Normalized Edit Distance): Measures the edit distance between two words and normalizes it to the length of the longer word.
- End-to-End: Sequential evaluation processing from detection evaluation (IoU) → recognition evaluation (WEM, 1-NED), etc.
However, traditional methods suffer from the inability to measure sophisticated performance. If a single correct box is split into multiple boxes and predicted, or if multiple correct boxes are combined into a single box and predicted, the exact performance cannot be measured. For more information on the limitations of traditional methods, You can find it in this article.
The clue to solving the above problem was in PopEval. The new evaluation method was found to be superior to the old method by Pearson statistical analysis, especially for solving the Split, Merge problem.
The principle is simple: calculate the percentage of correct answers and measure performance by crossing out identical letters in overlapping areas one by one. This most closely resembles the way humans are evaluated, and because it's letter-by-letter, it allows for sophisticated performance measurement.
Compared to traditional methods, the Popeval evaluation algorithm provides It's more of a human qualitative assessment. While it was conceived as a character-level approach, comparative experiments have shown that it is also compatible with existing datasets annotated at the word level, which is gaining traction in the industry.
Improving OCR quality in Datahunt
Preprocessing is key to improving the accuracy of OCR tasks. Datahunt has created a structure for AI and humans to collaborate in this process.
DataHunt was tasked with cleaning 10,000 images of receipts requested from KEPCO. After cleaning the raw data according to the data construction guidelines, we used Bounding box or Polygon according to the slope of the letters. After that, we performed text transcription of the worked area and proceeded with text classification for metadata extraction.
By introducing artificial intelligence into the preprocessing for OCR tasks, we were able to see significant improvements in not only accuracy, but also time. On average, we were able to reduce the time spent labeling data by 40%.
Conclusion: You can improve the quality of your data extraction by incorporating AI into your OCR system.
When training an AI with OCRed data, the density of the bounding boxes is of utmost importance. This is because this criterion is important for the model to learn the spatial relationships between text characters and words. A high density of bounding boxes means that the boxes are closely spaced, which allows the AI to learn that these characters and words are likely to be close to each other in the source document. This information can be used to improve the AI's OCR accuracy.
For example, if the density of the BBOX is low, the AI might not learn that 'A' and 'B' are likely to be close together in the original document. Of course, if the density is too high, the AI is likely to learn that all letters and words are always close to each other, so it might split the word 'THE' into 'T' and 'HE' in the process of learning.
Therefore, it is important to find a balance of bounding box density when training your model with OCR processed data. To do this, you need to apply a bounding box density that is appropriate for the type of document you are processing, and you need to make sure that you have good markup around the text that needs to be marked. By using different bounding box sizes and fonts, you can avoid overly It's also important to avoid generalizations. When DataHunt used AI for OCR, we also manually corrected the slope to make the bounding boxes denser.
At DataHunt, we follow the human in the loop principle, adopting methods that most closely mimic the way humans make assessments. At the same time, we strive to improve OCR quality by using AI to reduce task time and improve accuracy. OCR is the first step in building business automation, and it should be a top priority. It's an important mission, and we're here to show you how we're researching and vetting it to the best of our ability.
- Optical character recognition (OCR) is a branch of computer vision technology that converts text images into a computer-readable text format. It can be used to convert image files for which a text editor is not available into text documents, making their contents editable, searchable, and word-countable.
- OCR is a combination of text detection model and text recognition model, which combines classification, detection, and segmentation techniques. Recently, end-to-end models that do both at the same time have been developed. In addition, recognition rates have increased significantly by utilizing algorithms based on machine learning, and accuracy has improved by applying deep learning.
- Among the methods to improve OCR data quality, the evaluation method applying Popeval technology is effective in solving problems such as split and merge. DataHunt has created a structure where AI and humans collaborate mainly with AI-assisted technology, and human intervention in the image preprocessing process has resulted in improved work time and improved accuracy.
- [1904.01941] Character Region Awareness for Text Detection
- Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks