What is Object Detection?
Object detection refers to the task of localizing an object in an image and assigning its attributes (classification). In various projects at DataHunt, it is necessary to not only classify the general attributes of an object, but also to provide its location in the image, and depending on the situation, the attribute may not be a single attribute such as"animal type" but a variety of attributes such as "animal type," "color," "pose," etc.
For example, the attribute of an object in a photo is not only "dog," but also the position of the animal is"center," the pose of the animal is "sitting," and the color is "white," and so on.
Specific data labeling captures and assigns common attributes to the puppet in the sitting position as shown in the photo above.Labeling various object detection tasks is a necessary process to introduce AI in various fields such as fashion, construction, and healthcare, not just autonomous driving, which is receiving a lot of attention these days.
Why YOLO as a topic?
One of the biggest advantages of DataHunt is that it utilizes AI models to help workers become more efficient when processing data.
We need to think a lot about the AI models that help our workers be more efficient, because even if they deliver labels quickly, if they are inaccurate, they will need to be corrected later, and even if they are as accurate as possible, if they take too long to deliver, they won't help us be more efficient.
So, among the many object detection models, we needed to find one that balanced speed and accuracy, which led us to YOLO.
How does YOLO perform in terms of both speed and accuracy compared to other models? YOLO has made a lot of improvements from v1 to v5 and is still improving. In this article, we'll review the core concepts that run through the YOLO v1 and later series, along with their differences from other models, pros and cons.
All models in the R-CNN family that are categorized as two-stage detectors are tasked with creating a set of candidate regions across many areas of the image and predicting Objectness, object regions, and classes for each region.
The advantage is that the latter part of the process is computationally intensive and performed serially, making it more accurate. The downside is that it's also slower.
To improve the above speed issue, YOLO proposed the structure of One-stage Detector. Since the feature map from Backbone acts as a grid cut of the original image, the rest of the work, including the presence or absence of objects in each grid, can be performed without the need to generate candidate regions.
You can think of this as being possible because the convolutional operation across the network is spatially-invariant.
In other words, the relative positions of the pixels in the image don't change when the computation is performed, so if Jisu and Jae-yoon are sitting next to me, it doesn't matter how many times I change offices and move, Jisu and Jae-yoon will still be sitting next to me.
Predict at Once
As you can see in the figure above, YOLO uses the last layer to infer all the information at once, which is shown in the figure below.
Define regions with different sizes and proportions in each grid (called anchors) in advance, and infer the location, size (x, y, w,h), and objectness of the region based on the number of anchors.
Separately, we design a way to categorize the classes of objects in the region covered by those anchors. With this structure, YOLO is organized to obtain the desired information (region location, objectness, and class) in one step.
While the two-stage structure computes this process separately for each region candidate, in YOLO, the number of results to be inferred changes from the number of regions to the number of lattices. And since regions are usually designed in 1k units, while lattices are created in10s and 100s of units, the computation speed can be significantly reduced.
Advantages and disadvantages of the YOLO model
This is a brief overview of the YOLO model. As mentioned, the above content will be common to the rest of the series without much change, so it is necessary to point out the features, advantages, and disadvantages of the YOLO model summarized earlier.
- Fast compared to two-stage
- Low positive error on background.
- Poor detection performance for small objects.
- Performance on overlapping objects is somewhat poor.
In the end, YOLO is a model that achieves faster inference at the cost of slightly lower accuracy. However, the speed gain is enough to offset the slight decrease in accuracy. It has already been utilized in various real-world services, and a number of structures have been proposed and published in the YOLO series to improve the above shortcomings, which we will cover in a later post.