Active Learning accelerate your data sampling

How efficiency correlates to data

Suho Cho
Active Learning accelerate your data sampling

What is Active Learning?

In the context of AI training, active learning is the process of selecting data from unlabeled data that is likely to be the most effective for training a model. In this context, effective learning means that if the expected performance of a well trained model on the entire dataset you have is 100, you might ask, "How can I get to 100 faster?" or"Can I get more than that?" in terms of sampling speed.

From the AI model's point of view, these are problems(data) that it hasn't even studied yet, so how does it know which one is going to be more helpful to me?", but in fact, humans sometimes know intuitively. Which problems are more difficult, which chapters will be more important for me now if I study them.


Why active learning?

In Datahunt, we use an AI model to do the pre-labeling, and then a human operator to correct the work. To illustrate this, let's create a situation.

I (the human worker), who has injured my hand and it takes me a lot of time to solve the problems by myself, asked my friend (theAI) to pre-label a cone paper for me in preparation for a school exam. The better my friend studies, the fewer problems I will have to solve myself and the better my grade in the exam will be. So I, being a good student, decided to give my friend a lesson.


Coning paper version 1

  1. [Me] First, choose a random problem (data) or a small amount of data and solve it yourself.
  2. [My friend] Study the problem I solved (model learning).
  3. [My friend] creates a cone paper (pre-label) based on what he has studied.
  4. [Me] take the test based on the cone paper handed to me by my friend (AI).


At first glance, this sounds great, but in the end, the model is likely to be a not-so-smart model trained on an initial small amount of data. It's hard to trust a concepts paper from a friend who hasn't studied it properly... it's more likely to mess with your head.

So I decided to periodically tutor a friend. It's hard to stay with someone for 24 hours to help them study, so I'll do it at a reasonable interval.


Coning Paper Version 2

  • First, choose a random problem (data) or a small amount of data and solve it yourself.
  • Repeat the following.
  • ~[Friend] Study the problem I solved (model learning).
  • ~[your friend] make a cone paper (pre-label) based on your study.
  • ~[me] solve a small amount of problems (data) on my own using the training paper as a reference.
  • [Me] take a test based on the cone paper handed to me by my friend (AI)


The more you work, the more the model learns. If we do the lessons once a day, we can build a model that gets smarter every day, and the coning papers we get will get closer and closer to the right answer and seem more reliable.

However, even if we study everything once, it's unlikely that we'll be equally skilled at each topic, because some things we understand well, others we don't. So I think it would be more effective if we focused on the topics that our friends don't know well.


Coning Paper Version 3

  • [Me] first pick a problem (data) or a small amount of data and solve it myself.
  • Repeat the following.
  • ~[Friend] Study the problem I solved (model learning).
  • ~[Friend] Make a cone paper (pre-label) based on your study.
  • ~[Friend] Your friend creates a cone paper and picks problems that they think need more work (sampling).
  • ~[Me] Solve the problem (data) that your friend selected.
  • [Me] I take a test based on the cone paper my friend (AI) gave me.


I'm so proud of my friend for identifying his weaknesses and studying more. I'm looking forward to taking the test with the concepts paper he created and seeing if I can get a perfect score.


Active Learning in a nutshell

Here's an illustration of this process in ActiveLearning.

Active Learning structure
Easy to understand Active Learning


In this diagram, there are several elements of active learning.


  • Initial amount of learning
  • Learning Cycle/Additional Learning Amount
  • Sampling methods


Let's take a look at the most researched and important of these sampling methods, and how we at DataHunt have used Active Learning to effectively speed up our sampling efforts.

There are actually several different methodologies withinActive Learning.

Membership Query Synthesis (where you create the data you want the current model to learn), Stream-based Selective Sampling (where you look at the data as it comes in sequentially and decide what to learn), andPool-based Sampling (where you look at all the data you have and decide what to learn first). Today we're going to talk about Pool-based Sampling, and I'll explain why below.


Sampling methods

At DataHunt, we often apply pool-based sampling among other strategies for active learning because we often have a full dataset to work with, and pool-based methods have been studied the most in real-world services, where you often have a lot of data to start with.

There are a number of ways to do this, but UncertaintySampling is gaining traction. This is because it's easy to implement, universal, and has the advantage of fast sampling rates. (This is important.Because even when your model is resting, work is still going on, so you need to be able to pass pre-labels as quickly as possible.)

Uncertainty Sampling is exactly what it sounds like: away to pick data based on uncertainty. The premise is that the more confused your model is about the answer, the harder the problem is. There are three main ways to determine uncertainty, the most famous of which is an entropy-based uncertainty calculation called Maximum Entropy.



Let's look at the simplest Classification problem, where the model is classifying 3 classes and the results obtained for each input data are shown below.

Active Learning Result
classification Result


Least Confident: Choose the data with the lowest probability of the most likely class

  • Looking at the most likely candidates for each class and their probabilities, we see that Candidate 1: 0.5 (puppy), Candidate 2:0.34 (hamster), Candidate 3: 0.4 (puppy or cat).
  • So the prioritization is: Candidate 3 > Candidate 1> Candidate 2


Margin Sampling: Selecting data with the smallest difference in probability between the most likely class (Top-1) and the next most likely class (Top-2).

  • Calculating the Top-1 and Top-2 for each candidate, we find that Candidate 1: 0.1 (puppy - cat), Candidate 2: 0.01 (hamster - puppy),and Candidate 3: 0.0 (puppy - cat).
  • So the priority is: Candidate 3 > Candidate 2 >Candidate 1


Maximum Entropy: Select data with the highest entropy

  • To get the entropy, see

Maximum Entropy equation

  • If we calculate the entropy of each candidate, we getCandidate 1: 1.361, Candidate 2: 1.585, Candidate 3: 1.522
  • Therefore, the prioritization is: Candidate 2 >Candidate 3 > Candidate 1


You can get different results from each method, but theMaximum Entropy method is known to work best (information theory emphasized the concept of entropy so hard for a reason).

In our example, we used just three pieces of data, but in practice, you'd have to pick the best ones to learn first from a large data set of at least 10,000 and as many as 100,000. It's a time-consuming process, and for some reason we don't do this with all of our data, but I'll explain why we did it at DataHunt and how the experiment turned out.



Datahunt's Active Learning Experiment

A study by Beluch et al. states that the method of selecting the K-most uncertain samples from an unlabeled set is not very effective. The reason is that it is easy to duplicate information in data with a small number or complexity of classes relative to the dataset size. Therefore, the key is to sample from a smaller subset.

  • Sampling from a smaller subset is effective, and that subset should be drawn at random.
  • Pros: less redundant information
  • The disadvantage is that it's harder to find the global optimum.

The size of the sample can vary depending on the experiment. In our experiments, we set it to 10% of the total data size.


Experimental Design


  • Model: YOLOv5s
    We went with a lightweight model for a quick experiment
  • Uncertainty calculated on: class probability of each resulting bounding box
    Bounding boxes have coordinates as well as classes, but it was not easy to get the uncertainty of the coordinates without knowing the correct answer.
  • Dataset: Private
    Datahunt's secrets...
  • Initial amount of labeled data: 2% of the total
    We decided on a minimum amount of data to train the initial model - too little and the model has nothing to learn, too much and the workers don't take advantage of the pre-label.
  • Initial amount of labeled data: 2% of total
    We decided on a minimum amount of data to train the initial model on, as too little means the model has nothing to learn, and too much means the workers can't take advantage of the pre-label.
  • Sampling frequency and quantity: 10% of the total data every 20 epochs.
    We chose this level because the frequency of sampling and model retraining affects the actual efficiency of the human workers.
  • Total training volume: 100 epochs of experimentation, with 42% of the data labeled and used for training.
    There is no scenario where all the data is used for training, because that would mean that the labeling work is already done. Before all the data is labeled, the model must be trained and ready to pass pre-label.

Experiment method 1 to speed up sampling operations

How to experiment

  1. Randomly sample 20% from the unlabeled data (). Let's call this set of data.
  2. get the model inference results for all the data in the set, and select the data with the highest entropy value (depending on the experimental design, set it to 10% of the total data).
  3. retrain the model by adding the data to the already labeled dataset ().
  4. Repeat the above process every 20 epochs for a total of 100 epochs.

In this way, the experimental results did not change from 0.41 → 0.41 in terms of mAP, where the comparison group is a model that did not perform active learning but randomly selected data and trained.

Why did the algorithm, which was clearly effective in the paper, not work in this experiment? There are a number of possible reasons.

  • The initial amount of data, the amount of sampling, and the retraining interval were different from the experiments in the paper.
  • The difficulty of learning the dataset itself
  • Differences in sampling methods

We decided to tweak the third of these - the sampling method - because the first can have a significant impact on task scheduling, and the second is harder to control.

If you think about it, what are the implications of this method? Let me give you an example.

If you have a school with 1000 students, and you arbitrarily divide them into 5 classes, and you select 100 students from each class in order of their grades.

  • If you just picked 100 students from the whole school in order of grade, you'd have an overly biased sample for grades.
  • If you increase the number of classes to 10, so that there are 100 students in each class, and then pick one class and pick 100 students, you end up picking every student in that class, which is no different than random sampling.

Rather than just collecting data that the model is currently struggling with, wouldn't it be more effective to train on a slightly wider range? We call this biasing the training data near the decision boundary, and this will lead to learning that is biased towards certain classes, or less generalized learning that doesn't take into account the distribution of other data that needs to be learned in the future. So we proposed a different method to solve this problem.

Experimental method 2 to speed up sampling operations

Experiment method

  1. Randomly sample 20% from the unlabeled data. Let's call this set of data.
  2. get the model inference results for all the data in , and select the data with the highest entropy value (depending on your experimental design, set it to 10% of the total data).
  3. randomly sample data from the set.
  4. retrain the model by adding the data to the already labeled dataset ().
  5. repeat the above process every 20 epochs to train up to 100 epochs.

In sampling the data, we perform a truly random sampling from the rest of the group to create a generalized subset that is not too concentrated on the decision boundary. We could make the ratio of data sampled from and to be 50:50, or we could make it 30:70. But for this experiment, we didn't want to go too far out on a limb and went with 50:50.

The result was 0.41 → 0.43, which is a significant improvement in sampling rate performance over multiple iterations despite training the same amount of data.

Graph of mAP over epochs - results with active learning (blue) and normal learning (red), with test set performance over the entire training process on the left and a zoomed-in view of the back end on the right.


In this article, we described Active Learning and showed how we applied it to speed up sampling operations in DataHunt.

In the process of learning the model used to deliver pre-labels to improve the work efficiency of workers, we did not learn without any strategy, and although it is not very complicated, we were able to create a better model through our own strategy and experimentation.

Talk to Expert