Hello, today I'm reviewing Track Anything, which applies Segment Anything to video.
It's not a full paper, and I feel like I'm applying a model that I need to create a service, but the results are quite good, so I decided to review it.
arXiv 2023. [Paper] [Github]
Jinyu Yang, Mingqi Gao, Zhe Li, Shang Gao, Fangjing Wang, Feng Zheng
24 Apr 2023
Features of Segment Anything(SAM)
What are the characteristics of the Segment Anything Model, an emerging powerhouse in image segmentation?
- Class-agnostic: Does not distinguish between classes
- Interactive: Receives prompts from the user (point, box, text, mask)
- Powerful: It's just plain powerful
I think the standout among these is Interactive, because if it were just accurate (Powerful), it wouldn't add anything new and we'd be rehashing the same old applications. However,Interactive opens up a whole new set of applications because it allows for interactive segmentation in video, which we'll discuss next.
However, using SAM directly on video did not deliver impressive performance due to the lack of temporal response.
On the other hand, tracking or segmentation of videos faces challenges such as size variations, object deformation, motion blur, camera movement, and similar objects. Even state-of-the-art models struggle with complex scenarios in public datasets, not to mention real-world applications.
So in this paper, we utilize these advantages to perform Video Object Tracking (VOT) and Video Object Segmentation (VOS) in an interactive way! More specifically, we present a well-crafted VOS model calledSAM + [XMem].
The upshot is that
- Rather than simply applying SAM to each frame of a video, it utilizes temporal correspondences to perform trueVOS.
- It has a very easy interface that solves Interactive Tracking & Segmentation in a single interaction.
- It works so well
It's called the Track AnythingModel (TAM), and let's take a look at the details.
Track Anything Model (TAM)
.webp)
TAM overview
This paper introduces the Track-Anything project, which develops an efficient toolkit for high-performance object tracking and segmentation in video.
With a user-friendly interface,Track Anything Model (TAM) can track and segment all objects in a given video with a single inference. In particular, TAM interactively integrates SAM, a large-scale segmentation model, and XMem, an advanced VOS model.
First, users can initialize SAM interactively, such as by clicking on an object to define the target object. XMem is then used to provide a mask prediction of the object in the next frame based on its temporal and spatial correspondence. SAM is then used to provide a more accurate mask description. During the tracking process, the user can pause and correct on the fly if tracking fails.
Methodology
The sequence of this study is asfollows.
- Step 1: Initialization with SAM
In the first frame, the initial mask of the object of interest is obtained through SAM. Following SAM, the user can obtain the mask description of the object of interest with a single click or modify the object mask with a few clicks to achieve a satisfactory initialization. - Step 2: Tracking with XMem
We will use XMem to obtain the masks of subsequent frames. Since XMem is an advanced VOS method that can produce satisfactory results in simple scenarios, in most cases XMem will output a predicted mask. If the mask is of poor quality, save the XMem prediction and its intermediate parameters (ex> probe, affinity) and skip to step 3. - Step 3: Refinement with SAM Permalink
If you are not satisfied with the quality assessment of your data, use SAM to refine the mask predicted by XMem. Use probe and affinity as point prompts for SAM and the mask predicted in step2 as mask prompts for SAM. SAM can then use these prompts to generate a fine-grained segmentation mask. This fine-grained mask is added to XMem's temporal correspondence to refine all subsequent object identifications. (The prompts are Probes, Affinities, and the mask of the frame you don't like from XMem.) - Step 4: Correction with human participation
After the above three steps, TAM can now successfully solve some common problems and predict segment masks. However, in very difficult scenarios, especially when dealing with long videos, it is still difficult to accurately distinguish objects. Therefore, we add human correction during inference. This allows us to achieve a qualitative leap in performance with very little human effort. Specifically, the user can force the TAM process to stop and modify the mask of the current frame with positive and negative clicks.
You get the idea, but I think you need to know the concepts of XMem, Probes, and Affinities to understand TAM. So let's see if the XMem paper has that information.
XMem
.webp)
The XMem paper proposed a new proposal to solve the problem of the existing methods in VOS,
- The Recurrent method drifted from frame to frame (getting more and more error).
- Memory method consumes too muchGPU memory and is slow.
So, we tried to solve the above problems by utilizing 3 types of memory - Sensory, Short-term, and Long-termMemory.
If you don't know what MemoryNetworks are, you can find a good explanation on the KU DSBA YouTube channel.
So if this paper does VOS perfectly, we wouldn't even need to combine SAM, but what are the drawbacks?
Disadvantages of XMem
- It is a semi-supervised method and requires a good initial mask
- Difficult to modify in the middle of a video as it gets longer
So at TAM, they wanted to combine XMem with SAM.
And a short mention about the probe we wanted to know,
- Affiinities: is a matrix that computes the similarity between Query (current frame) and Memory keys (previous frameset).
- Probes: was not mentioned in the paper, so it was hard to find out more about it.
Experimental Results
Quantitative Results
Their attempt to obtain a mask while tracking in a video yields good results.
.webp)
The results above show that compared to other papers,
- Initialization: It's very simple because it's done with a click.
- Evaluation: Users doesn't have to modify it many times during the process.
- J&F: Performance is very good compared to existing methods
If you ask why STM, AOT, and XMem don't perform better, you're right in that those methods require sophisticated masks initially.
Qualitative Results
It handles target deformation, scale changes, and camera movement well, demonstrating excellent tracking andsegmentation with only a click initialization and a single inference.

Failed Case
Failed cases are represented by two cases.

- Current VOS models are mostly designed for short videos, focusing more on maintaining short-term memory than long-term memory. This leads to shrinking masks or lack of sophistication in longer videos like in (a), which we eventually try to address with SAM's refinement capabilities in step 3 (Refinement with SAM Permalink), but the effectiveness is lower than expected in real-world applications. Human participation/interaction in TAM can be an approach to solve this challenge, but too much interaction can lead to low efficiency. Therefore, long-term memory preservation and temporal memory update mechanisms are still important.
- 2. When the object structure is complex, the bicycle wheel in (b) creates many holes in the ground-truth mask. This is because it is very difficult to obtain a finely initialized mask by propagating clicks, and this initialization can have side effects on subsequent frames, leading to incorrect predictions. This shows that SAM still struggles with complex and precise structures.
You can check out the video onGithub at the link below.
https://tv.kakao.com/channel/4358324/cliplink/437781131
Conclusion
At DataHunt, we plan to use these techniques to better segment objects in videos. As you know, video is a set of continuous images, so even if it seems like a lot of work, you can effectively detect objects by utilizing information from the previous (past) or next(future) frame.
Therefore, we will share the results by comparing the work without the help of AI and after applying the model.
Reference