In April 23, Meta AI released DINOv2, a new method for training high-performance computer vision models using self-supervised learning.
The self-supervised learning method, which is also used to train large language models (LMs), has a number of advantages because it does not require large amounts of labeled data to train the model.In this post, we will review how the self-supervised learning method was born, its structure and advantages, and the results obtained with DINOv2.
Paper: https://arxiv.org/pdf/2304.07193.pdf
Code: https://github.com/facebookresearch/dinov2
Demo: https://dinov2.metademolab.com
The need for Self-Supervised Learning
Before we get into the thesis, let's briefly mention what Self-supervised Learning (SSL) is and why we need it.
What is Self-supervised Learning, SSL?
SSL is the process by which a model learns to better understand the data it is given.

In the image above, Supervised Learning (SL) on the left uses the correct answers given by humans to train the model, hence the name Supervised, i.e., the model learns with human guidance.
On the other hand, SSL on the right is called Self-supervised, which means that the data itself holds the clues to learning. For example, when a part of an image is occluded, the model learns to correctly restore the occluded part. This is a great method that can be used in any environment because it doesn't require a human to provide labels. Of course, inferring blanks isn't the only way to do it, and there are many different SSL approaches.
Why is SSL needed?
Techniques like object detection and segmentation that we are familiar with are detailed applications of the model, called downstream tasks.
Models utilized in most industries need to be good at their respective downstream tasks, and this is essential because you can get better performance by performing SSL first and then training for downstream tasks rather than training for downstream tasks only from the beginning.

There are many different ways to do SSL, but experiments have proven the assumption that the process leads to a complete understanding of the image by the model.
It seems like a no-brainer because it's simple, doesn't require labels, and in most cases leads to better models.
The paper we're reviewing today, DINOv2, is of interest because it claims to get better results from SSL than traditional methods.
Introduction of DINOv2
Big names like Meta, Google, and OpenAI seem to be working on foundation models these days. In NLP, GPT has been dominant and much loved by the public, but what are the models in Computer Vision?
There have been studies like Radford et al. and others, but they all adopted text-guided pre-training, which has two big problems.
- Pixel-level information is difficult to learn because text only contains rich information.
- Image encoders always require text (aligned text-image corpora), so images cannot be learned in isolation.
So have there been any studies that have tried to learn without text? Of course there have been, but the problem with those studies is that the
- The dataset is too small (ImageNet-1k)
- or the quality of the dataset is very poor (Uncurated)
So in DINOv2, we're all set.
Image / Patch Level Discriminative Self-supervised Learning(SSL) like iBOT
- Image level is the task of distinguishing different images as different
- Patch level is the task of distinguishing between different patches in an image.
- In short, it's an iBOT with something tacked on.
Large, cleaned datasets
- Dataset of 142M
Fast learning techniques with low memory usage
- 2x faster and uses 3x less memory
If we look at the output of our patch-level learning for a moment

As shown in the figure above, when we analyze the PCA at the patch-level, we can see that it significantly classifies the elements of the object. If we have an eagle, we can see that its wings, body, and head are divided into different colors. This is the basis on which we can say that the model semantically understands the object.
So let's take a look at what we did in DINOv2.
Methods of DINOv2
As mentioned above, if you call it a foundation model, you need a dataset and model to match.
Data Processing

If you look at Table 15 in the paper (in Appendix), you'll see that we scraped together a total of 142M data from various datasets, which we call the LVD-142M dataset in the paper. Here's how it was created
- Collecting raw images: Diligently collect images from the internet. We collect a total of 1.2B images, including all the bad ones. These images are the uncurated data.
- Deduplication: Copy detection works hard to remove duplicate images. Remove all images that are duplicates or infinitely close to duplicates.
- Self-supervised image retrieval: Now we're going to curate the uncurated data we've collected from the internet, and the process is as follows.
(a). Both Curated and Uncurated pull embedding through the ViT-H/16 model.
(b). Perform K-means clustering on uncurated data.
(c). For each curated data, pick the closest \(N\) uncurated data, or find the closest cluster and pick \(M\) within it.
(d). The \(N\) and \(M\) will look at the selected images by eye and adjust accordingly based on their quality.
In this way, we got a total of 142M images. Curated data is a collection of different known datasets as I mentioned earlier, and depending on the dataset, I chose as is (just the curated data source), sample (find the closest data), and cluster (sample from the closest cluster) methods, and you can see the results in Table 15.
Discriminative Self-supervised Pre-training
The DINOv2 model itself is said to have a lot of different ways to improve performance. It's like a Chinese medicine with all the ingredients that are said to be good for you. It reminds me of the YOLO series in a similar way. Before I explain DINOv2, I'm attaching the DINO structure.

Image-level objective
Utilize the Knowledge Distillation method. Use cross-entropy loss to distill between a large, well-learned Teacher model and a smaller, less-learned Student model (which actually needs to be learned).
Patch-level objective
The image you feed into Student is masked by a few patches, and the one you feed into Teacher is unmasked, and the resulting cross-entropy is computed patch-wise. The following illustration should help you understand.

In the illustration, we've only shown one patch for illustrative purposes, but in reality there are many more masked patches.
Untying head weights between both objectives
For those of you who are unfamiliar with the concept of weight tying, it means that as we calculated cross-entropy for both Image-level and Patch-level above, there will be a head consisting of several layers to get the tensor used to calculate cross-entropy, and that head will be shared by both objectives.
Sinkhorn-Knopp centering
It seems to have adopted Sinkhorn-Knopp BN, which was proposed in a paper called SwAV, instead of Teacher softmax-centering. It seems to be a way to close the Teacher-Student domain gap, but I don't really understand it.
KoLeo regularizer
The paper is that if you have multiple features in a batch, how to make the distance between them as equal as possible. This is done without changing their positional relationship to each other, of course.
Adapting the resolution
If you think about many existing image studies, there were many \(224\times224\) sizes, but if you resize the original image to such a small size when the resolution of the original image is large, small objects will disappear and will not be visible... So in this paper, we also introduced \(518\times518\) sizes, but because it takes a lot of memory and takes a long time, we did not do it from the beginning, but we briefly learned this size at the end of the pre-training.
After applying all of these methods, they saw a significant increase in performance. There's a table in the paper that summarizes this.

Efficient Implementation
Outside of the model structure, there are a number of options for training methods. However, this post would be too long to go into all of them, so I'll keep it very brief, and I'll update this post with more details as they become available.
- FlashAttention: It's called Self-attention with memory and speed improvements.
- Nested tensors in Self-attention: It is said that they applied a method that can forward pass global crop and local crop at the same time, even if they have different patch counts. This makes learning fast.
- Efficient Stochastic Depth: The residual approach is something that a lot of networks have adopted, which is to remove some of the residual connections through sampling. They say they've removed about 40% of the residual connections, and they say it works well.
- Fully-Sharded Data Parallel (FSDP): If you are familiar with PyTorch's DDP, this should be easy to understand. DP, DDP relies on the GPU memory of the master node (usually GPU 0), whereas FSDP can fully utilize the memory of all GPUs (previously bounded by a single GPU but by the total sum of GPUs across compute nodes).
- Model Distillation: We talked a lot about distillation above, and the way we applied it is that the largest version (ViT-g) is trained first, and then the smaller models that learn from it are trained using it as a teacher network, so that even the smaller models can perform well.
Results
After all, the power of data is great, and all the good work you do pays off.

Self-supervised learning alone has been shown to outperform weakly supervised methods, even with the same model size and without borrowing the power of text. Personally, I don't think DINOv2 has come up with any revolutionary new methods, but I really applaud the effort to improve in all areas, including model size and speed.
In addition to the above tasks, it has also shown results in video classification, instance recognition, segmentation, and depth estimation.

Live Demo
The demo site published by Meta AI provides demos for Depth Estimation, Semantic Segmentation, Instance Retrieval, and more.
Semantic Segmentation is not much different from the performance you've already experienced with Segment Anything Model, but it's fun to play with because it performs so well in Depth Estimation and Instance Retrieval.



Conclusion
Today we've lightly touched on DINOv2, SSL's latest monograph, which is necessary for models to have a good understanding of the data itself.
This study is also very valuable for DataHunt because it shows that the
- Too little data for good Supervised Learning results
- Structure your work so that (in many cases) all data comes in first, and labeling happens afterward.
This is especially true for the second reason, because we don't have a lot of labeled data to begin with, which is where SSL shines, because it learns from a lot of data that we have without labels to begin with, so we can give our workers a little bit better pre-labeling than we could with traditional methods.
Shortly after Meta AI released Segment Anything Model (SAM), we released the second version of DINO. As the interest in LLM has grown, there have been many cases of AI chatbot construction and inquiries for adoption. In the field of computer vision, foundation models with powerful performance comparable to LLM are also emerging. As more and more models that can derive the desired results using large-scale data without labels are available, it seems that the way to use computer vision in various industries is now opening up.