Hi, today I'm going to show you how we at Datahunt have leveraged AI and our own algorithms to streamline the limitations of BERT for Named Entity Recognition (NER) models, which are often mentioned in text work.
NER is a technique for extractive classification by recognizing words (entity names) in documents that correspond to predefined people, companies, places, times, units, etc. Here, the extracted entity names are person, location, organization, time, etc., which are not necessarily followed, and the classification system can be different depending on the task.
For example, in the sentence
Data Hunt moved its office from Gangnam Station to the vicinity of Seomyeonneung in 2022
Data Hunt can be classified as an organization, 2022 as a time, and Gangnam Station and Seomyeonneung as place names. Of course, it does not convey the exact meaning, but you can get a sense of what the sentence is about from the extracted entity names. If you can accurately classify entity names like this, it is very helpful for extracting core information.
It is common to train a NER model with processed data and use the trained model for NER dataset processing. However, there are some issues with this approach, so let's take a look at some common NER models and see what they are and how we at DataHunt have streamlined our NER dataset processing efforts.
The most famous NER model is the Bidirectional Encoder Representations from Transformers (BERT) model.
While the world is buzzing with GPTs these days, BERT is also an indispensable model for natural language and has the following features
With a pre-trained BERT model, you can achieve good performance with less data and computing resources.
However, BERT is not a perfect model either. There are various reasons for this, including
If we train the BERT model with a lot of data, we can improve the above problems a little bit, but there is a problem that we need to train every time and rely on the BERT model for all results. Therefore, as the work progressed, we thought about how to retrain the model continuously.
Therefore, we created a new pre-labeling algorithm based on the statistics obtained by accumulating a database of worked object names. The process is very intuitive and interactive, and the advantage is that it can be applied quickly without any learning process. The overall process, including utilizing the BERT-based NER model, is shown below.
Our proprietary confidence rating algorithm has the following advantages
Through this process, we delivered as many pre-labels as possible to the workers, but of course, there is a quantity-accuracy trade-off for pre-labels based on the confidence assessment algorithm, so we made it flexible enough to change depending on the project.
An example of an improved result from the developed confidence assessment algorithm is shown below.
In this case, it was a good complement to the object name that could not be detected by the NER model alone, and it was possible to improve work efficiency because the object (word) was included in the previously worked text. It seems to be a simple and effective method that can provide more accurate results as more data is accumulated, just like a deep learning model. However, the problem of tagging the same word as an object name needs to be improved, and as with model training, we had to consider that the data must be accumulated to a certain level to make a valid confidence assessment.
In 2022, DataHunt worked with Boston Consulting Group (BCG) on a project to build a dataset for NER model development. It was a challenging project with hundreds of thousands of entity name tags and many words that needed to be categorized as different entity names depending on the context. As the data was substantial, the duration of the assignment was not short, so there was plenty of time for the NER model training or the confidence algorithm to be refined. This allowed the pre-labeling to be advanced gradually, which greatly helped to improve the efficiency of the worker's work.
Like this, DataHunt is using human-in-the-loop AI to streamline data collection processing and build accurate datasets in various fields, including NER.