Overview | Why you need a deep learning-based OCR model
KEPCO worked with Datahunt on an OCR challenge last year. DataHunt improved the quality of data processing by utilizing bounding box technology and showed excellent results by improving work efficiency through artificial intelligence.
Optical Character Recognition (OCR) technology is a computer algorithm used to recognize and extract text from photos, scanned documents, videos, and more.The technology is an optical character reader, which converts characters visible to the human eye into a data format that can be read by a computer.
The OCR model is having a social impact in a variety of ways.
- By converting written text into a digital format, OCR can improve accessibility for individuals with reading difficulties.
- By automating the process of digitizing recorded information, OCR enables organizations to process information more efficiently in industries such as healthcare and finance, where large amounts of data are processed daily.
- By converting documents into a digital format, OCR plays a key role in preserving and preventing the deterioration and loss of valuable cultural artifacts.
Problem | Building RPA with OCR training data
RobotProcess Automation (RPA) refers to the technology that automates repetitive tasks performed by humans. Software robots can be used to perform tasks such as data entry, data extraction, and data processing. It's often used in industries like finance, healthcare, and manufacturing that require human intervention, and it allows organizations to improve efficiency, accuracy, and productivity.
By combining OCR and RPA, businesses can automate data entry and data processing tasks associated with paper documents such as invoices, receipts, and forms. By automating the manual process of entering paper data into digital systems, businesses can reduce errors and increase efficiency, especially as it reduces the cost and resources spent on manual labor.
KEPCO wanted to expand the scope of its automation efforts by combining its OCR model with an RPA system. To do so, it needed to build data for model training to improve the performance of its OCR model, and extract object name information to feed into the RPA system.
The ultimate goal was to build an intelligent automation system through the combination of deep learning technology and RPA.
The key to any OCR task is preprocessing. The performance of the model itself is important, but if the preprocessing is of low quality, accuracy will suffer.The OCR model performs either a Bounding Box or a Polygon operation on each piece of text. In terms of difficulty, polygons are more difficult than bounding boxes. Leaning characters are less accurate for polygons and less accurate for transcription than upright characters.
Therefore, it was important to organize high-quality OCR data and improve work efficiency.
Solution | Multi label classification with OCR
The data for OCR training requested by KEPCO was 10,000 images of collected receipts.
Datahunt cleaned the source data according to the data construction guidelines and used a bounding box or a polygon depending on the slope of the text. After that, we transcribed the text for the processed area and proceeded with text classification for metadata extraction.
Establishing data construction guidelines
The working standard for the output was words with spacing. The target languages wereEnglish, Korean, and special characters, and we analyzed all the letters written on the receipt, including handwriting. Depending on the slant of the letters, we used an appropriate mix of bounding boxes and polygons.
For accuracy, we also scrutinized the work guidelines, which you can see below.
- Based on the transcription: 6 object names and 2 other classes were categorized and labeled with their respective attributes.
- Object names - merchant name, transaction date, transaction time, total amount, card number, authorization number, etc.
- Other classes - tear line, signature
- Bounding box / Polygon labeling criteria: All texts including symbols are labeled by dividing them into spaces.
- However, all symbols other than special symbols (31 types) that can be entered by keyboard are treated as '$'.
Finally, the object name classes for metadata extraction can be categorized as follows.
- Payment merchant name
- Payment transaction date
- Payment authorization time
- Payment total amount
- Payment card number
- Authorization number of the payment
Refining the source data based on the scope of the work
The improvement work on skewed text described above took a different approach.Rather than letting AI do all the work, it required an intermediate step of detecting skewed images and rotating them manually by a human operator. This would allow AI to create more accurate labels. Why did DataHunt take this approach? It was purely to improve the quality of our OCR data.
Q. Couldn't you just do it without having to manually rotate the images?
A. When we looked at different methods, we realized that it was difficult to rotate most of the letters so that they stood upright.
We decided that the time it would take to rotate the image compared to the time it would take to do the OCR was worth it for the quality of the OCR AI to be better than subpar.
Q. Couldn't we just train the model from the beginning with all the skewed text?
A. The OCR model is already trained with a lot of skewed text, but the limitation of the AI model is that it is harder to learn polygons than bounding boxes.
The project required an average of 50-70 objects per image, which is relatively time-consuming and manpower intensive. We decided to train and engage around100+ crowdsourced workers to ensure smooth project progress.
However, this process consumed a relatively large amount of resources. In order to meet the deadline, we applied the Text Detection feature, which efficiently reduced the time spent on text transcription.
Data check/ Admin cross-check after receipt data processing work
After all the work processes are completed, the deliverables are checked. Datahunt checks that the progress of the work in the deliverables is 100% accurate. We perform a "full check" to ensure that the work has been completed according to the guidelines.
- Checks for typos, spacing, and categorization of entities.
- No bounding boxes or polygon margins based on text slope.
- If there's an unworkable type, find out why and make sure it's actually unworkable.
- Make sure there are no major/minor bugs or defects in the QA process.
Given the type of work, there is a high probability of human error. However, we collaborated with the AI team to pre-process the data with typo detection, which allowed us to reduce resources to maintain data quality and meet delivery schedules.
Result | OCR training data with 99.995% accuracy utilizing bounding boxes
After the client's inspection, DataHunt was thrilled to have crossed another milestone:the error rate was only 0.45% during the sample inspection. Compared to the existing open data, the accuracy was 10% higher.
KEPCO collaborated with DataHunt to build an RPA system with technology that automatically extracts characters from documents, freeing them from the time-consuming repetitive tasks they had to perform.
UsingAI for OCR also improved Data Hunt's work time. On average, we've been able to reduce the time it takes to do OCR by about 40%, which is a significant change from working on one image to working on two in the same amount of time.
Less time, more accuracy
OCR technology, in particular, is prone to human error. However, DataHunt was able to achieve high quality and meet their deadline because they didn't rely solely on technology, nor did they rely solely on humans. They brought in more than100 crowdsourced workers to do the labeling manually, but used AI as a way tor educe their resources in the process.
Ultimately, the reason why DataHunt was able to successfully complete this project is simple. We understood the project from the outset and designed the right mix ofAI and humans, and we thoroughly checked for possible human errors. KEPCO also saw great results from applying RPA to their OCR model.