Hi, today I'm going to show you how we at Datahunt have leveraged AI and our own algorithms to streamline the limitations of BERT for Named Entity Recognition (NER) models, which are often mentioned in text work.
What is NER
NER is a technique for extractive classification by recognizing words (entity names) in documents that correspond to predefined people, companies, places, times, units, etc. Here, the extracted entity names are person, location, organization, time, etc., which are not necessarily followed, and the classification system can be different depending on the task.
For example, in the sentence
Data Hunt can be classified as an organization, 2022 as a time, and Gangnam Station and Seomyeonneung as place names. Of course, it does not convey the exact meaning, but you can get a sense of what the sentence is about from the extracted entity names. If you can accurately classify entity names like this, it is very helpful for extracting core information.
It is common to train a NER model with processed data and use the trained model for NER dataset processing. However, there are some issues with this approach, so let's take a look at some common NER models and see what they are and how we at DataHunt have streamlined our NER dataset processing efforts.
Existing models and issues
The most famous NER model is the Bidirectional Encoder Representations from Transformers (BERT) model.
While the world is buzzing with GPTs these days, BERT is also an indispensable model for natural language and has the following features
- Transformer Architecture: Self-attention based structure learns to understand a wide range of relationships between many wordsBidirectional Training
- Bidirectional Training: BERT learns input sentences in both directions. This ensures that the words in a sentence are understood in the context of both their preceding and following contexts
- Pre-training / Fine-tuning: BERT performs pre-training on a large corpus of languages. During the pre-training process, it performs two tasks, through which BERT acquires a general language understanding
- Masked Language Modeling
- Next Sentence Prediction
- Transfer Learning: One of the biggest advantages of BERT is that it is capable of transfer learning.
With a pre-trained BERT model, you can achieve good performance with less data and computing resources.
However, BERT is not a perfect model either. There are various reasons for this, including
- Tokenizer splits text into parts of speech, but it's not perfect
- Even if the text is well segmented, the model's object name classification accuracy is not perfect
- The same word can be a different object name depending on the context
If we train the BERT model with a lot of data, we can improve the above problems a little bit, but there is a problem that we need to train every time and rely on the BERT model for all results. Therefore, as the work progressed, we thought about how to retrain the model continuously.
DB-based trust evaluation algorithms
Therefore, we created a new pre-labeling algorithm based on the statistics obtained by accumulating a database of worked object names. The process is very intuitive and interactive, and the advantage is that it can be applied quickly without any learning process. The overall process, including utilizing the BERT-based NER model, is shown below.
- Pass pre-labels to workers via a BERT-based NER model
- The worker receives the pre-label and performs the task. In this process, creation/modification/deletion of NER tags occurs
- Accumulate the worked names in the DB (creating a word dictionary) and determine the trustworthiness of each name based on the designed algorithm
- Pre-labeling the names that exceed the confidence threshold for future work
Our proprietary confidence rating algorithm has the following advantages
- Real-time behavior without model training
- Can directly intervene in the logic and analyze the cause of misbehavior
- Not affected by text-splitting tokenizers
Through this process, we delivered as many pre-labels as possible to the workers, but of course, there is a quantity-accuracy trade-off for pre-labels based on the confidence assessment algorithm, so we made it flexible enough to change depending on the project.
Improvement results from evaluation algorithms
An example of an improved result from the developed confidence assessment algorithm is shown below.
In this case, it was a good complement to the object name that could not be detected by the NER model alone, and it was possible to improve work efficiency because the object (word) was included in the previously worked text. It seems to be a simple and effective method that can provide more accurate results as more data is accumulated, just like a deep learning model. However, the problem of tagging the same word as an object name needs to be improved, and as with model training, we had to consider that the data must be accumulated to a certain level to make a valid confidence assessment.
Own algorithms in action
In 2022, DataHunt worked with Boston Consulting Group (BCG) on a project to build a dataset for NER model development. It was a challenging project with hundreds of thousands of entity name tags and many words that needed to be categorized as different entity names depending on the context. As the data was substantial, the duration of the assignment was not short, so there was plenty of time for the NER model training or the confidence algorithm to be refined. This allowed the pre-labeling to be advanced gradually, which greatly helped to improve the efficiency of the worker's work.
Like this, DataHunt is using human-in-the-loop AI to streamline data collection processing and build accurate datasets in various fields, including NER.