Recently, LLMs have been creating very impressive performance with zero-shot learning and few-shot learning.
Segment Anything Model (hereinafter SAM), recently announced by meta, aims to create a foundation model for image segmentation, and we are trying to complement the problems of various types of image segmentation through Prompt Engineering.
I wish I had developed it myself, but let's take a closer look at the paper to learn quickly and improve the data quality of DataHunt.
Paper link: Segment Anything
Github link: Segment Anything Github
Segment Anything Introduction

Foundation models are collectively referred to as giant models that are pre-trained on huge datasets, regardless of the domain. These models show tremendous generalizability to the task at hand. In short, we can say that they are models with massive understanding.
The purpose of this thesis is to develop a foundation model for Image Segmentation. It is zero-shot, meaning that it is capable of objects that the model has not been trained on.
As with our recent series, there are some prerequisites: Task, Model, and Data. The authors of the paper ask three questions in this regard
- What task will enable zero-shot generalization?
- What is the corresponding model architecture?
- What data can power this task and model?
Segment Anything Task
We call ChatGPT a prompt-driven model in the sense that when a user orders something, we produce output accordingly. But can we do the same with Segmentation?
There are many different types of prompts, and we've designed ours to accept points, boxes, and text as input. In fact, segmentation by pointing is not a new approach. Work like Deep Extreme Cut(DEXTR), RITM, FocalClick have been done under the name Interactive Segmentation.
Segment Anything Model
The model can be broken down into the Image encoder, Prompt encoder, and Mask decoder areas. For convenience, we will group Prompt encoder / Mask decoder together and call it Prompt model. Image encoder is a ViT (Vision Transformer) based model for processing high-resolution images, which requires relatively longer time for inference. In our tests, the vit_l model took 0.669s (RTX 3090) and 8.257s (CPU) for a single 1080 x 1920 image. The Prompt model, on the other hand, has a processing time low enough to enable real-time inference. It also shares the characteristic that the same image embedding can be reused for multiple promptable tasks (point, bounding box, mask, text).
A characteristic of these behemoths is that they all have a powerful encoder. After such a great encoder, the back end is relatively lightweight. The model named in this paper is the Segment Anything Model (SAM), and it consists of three main parts.
- Powerful Image Encoder
- Prompt Encoder
- Mask Decoder

Compared to the Image encoder, the Prompt encoder and Mask decoder are so lightweight that they're claimed to run in less than 50ms on the web! Let's take a look at how they do it.
Segment Anything Data
Before we talk about datasets, we need to mention the Data Engine. GPT is all over the internet, but the masks needed for segmentation are not easy to find. So we created a Data Engine, which consists of 3 steps.
- Assisted-manual: The operator makes a dot and the SAM model creates a mask to some extent. This is the assisted labeling.
- Semi-automatic: SAM creates a mask on its own for a specific set of objects, while the human creates a mask for other objects at the same time. For example, you can tell SAM "Generate masks for all apples in an image" and while it's making the apple masks, a human is working on the other apples.
- Fully automatic: You're taking grid points in an image and masking everything on your own.
In particular, this is what Fully automatic looks like.

Fully automatic needs a little tweaking, but the results are amazing!
This is how they were able to collect so much data. 1B masks from 11M images
Segment Anything Method
We've explained that this model consists of three components. Let's go through them one by one
Image Encoder
We used the Masked Autoencoder (MAE) pre-trained ViT. I don't think I need to say much more about this model because it's already proven itself. When an image comes in, it goes through the Image encoder to get an embedding. Whatever you do with it afterward, that's where the Image encoder ends.
Prompt Encoder
There are two kinds of prompts, let's take a look at each and how they are created,
Sparse prompt
- Points: Positional encodings summarized with learned embeddings.
- Boxes: Positional encodings summed with learned embeddings
- Text: CLIP output
Dense prompt
- Mask: Perform convolution on image embedding
The code is quite long, so I've excerpted some of it, the source is here.
Mask decoder
This is the part that takes the Image embedding and the prompt embedding and predicts the mask. Basically, we're using Prompt Self-attention and Cross-attention in the Transformer decoder block in a bidirectional way (what I mean by bidirectional is Prompt-to-image, Image-to-prompt). The reason why it's bidirectional is that both Image embedding and Prompt embedding need to be updated.
.webp)
Keep in mind that this model is interactive with the Prompt.
It then performs upsampling to fit the image size and determines for each pixel whether it should be included in the mask.
Importantly, it doesn't generate labels! It just creates a mask.
Resolving Ambiguity
But if we take a dot, is the mask it points to the only one? What I mean is that when we dot a pixel with a human fingernail, it's hard to know if we want to get exactly that fingernail, or the hand, or a mask of the whole person. Normally, when you get multiple mask candidates, you average the confidence of each pixel to get a single mask, which is noisy and doesn't give you the correct mask of the person.
So! The paper suggests to give 3 mask candidates (3 is a good number) and backpropagate on the one with the minimum loss.

Losses and Training
Above, we mentioned learning the mask with the minimum loss. In this thesis, loss is a linear combination of focal loss and dice loss (see Appendix A.2).
- Focal loss is the idea of weighted learning on more difficult objects.
- Dice loss is the concept of focusing on recall a bit more than the familiar IoU. Dice is a function of how much more GT was included than IoU.
.webp)
SA-1B Dataset
Another important point in this paper is that in addition to task and model, we have data.
- Images: We collected 11M high-resolution images taken by a photographer, with an average resolution of a whopping 3300x4950.
- Masks: 1.1B masks were obtained from the above images, 99.1% of which were obtained by a fully automatic method. But what about the quality? I sampled about 500 images and had them corrected by an expert, and the average IoU before and after correction was 0.9. That sounds pretty good to me.
- Mask Quality: To evaluate the quality of the masks, I randomly sampled 500 images (with about 50k masks). We corrected this sampled data and obtained an auto-generated mask and a ground truth mask for it, and measured the IoU of the two and found that they were 85-91% IoU. It looks like the mask generated by Human Rating is pretty good.
.webp)
Experimental Results
The results are awesome. Of course, if you want to use this model for data labeling, that's another story, but the fact that it's zero-shot makes it incredibly versatile.
Zero-Shot Single Point Valid Mask Evaluation

I also didn't forget to compare it to other existing interactive segmentations, and it shows a very good overall performance compared to the RITM announced by Samsung. In Figure (a), the GTEA dataset at the bottom is the Georgia Tech Egocentric Activity Datasets, which is a dataset that captures daily activity from a first-person perspective. It's a particularly poor performer, so I took a closer look, but didn't find anything relevant.
.webp)
Zero-Shot Instance Segmentation
Comparing the performance of ViTDet and SAM on the COCO and LVIS datasets, we found that SAM falls short of ViTDet quantitatively, but does better on Human Rating.

Zero-Shot Text-to-Mask
The task experiment of performing segmentation from free-from text is a kind of proof of concept to test the performance of the model. It can be used in conjunction with click information, such as PhraseClick, when text input alone does not provide accurate output.

There are a few other details, but here's the short version
- The model was trained for 3-5 days on 256 A100 GPUs
- The image encoder takes ~0.15 seconds on an NVIDIA A100 GPU
- The prompt encoder and mask decoder take ~50ms on CPU in the browser using multithreaded SIMD execution
- The image encoder has 632M parameters
- The prompt encoder and mask decoder have 4M parameters
Conclusion
The model calls itself the "Foundation Model for Image Segmentation" and I think it deserves it. Plus, once I got the image embedding down, I found it useful to be able to use prompt to create masks very quickly and freely. I hope to be able to try it out soon and let you know what I think.