Paper review | VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Self-supervised Learning

Suho Cho
Paper review | VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Hello, today I'm going to introduce VideoMAE, a self-supervised learning system that has performed well in the field of video understanding. In this review, I will explain how we applied Masked Autoencoders (MAE) to video, what makes it different and what are its advantages.




In recent years, the Attention family of models, from Transformer to Vision Transformer (ViT) to Multi-head Self-Attention (MSA), has undergone several improvements and has shown good performance in many different fields. But what about video? The paper points out several problems and explains the difficulty of using ViT for video.

  • ViT requires a very large amount of supervised data (Large-scale Supervised Datasets)
  • Video transformer shave already been proposed, but they rely on pre-trained models for images
  • Existing video datasets are very small compared to image datasets.


Therefore, we proposed Self-supervised Learning that does not rely on labeled data! Specifically, we introduced Masked Autoencoder (MAE), which has been successful in image tasks, to the video context, and the main contribution of this paper is as follows.

  • This is the first video pre-training utilizing the ViT backbone (but is sticking with ViT an advantage?)
  • We show that the MAE structure is effective for self-supervised learning from videos
  • We found a secretin the world of masked modeling that was hard to find (and unimportant) in traditional images and natural language
  • Great learning with as little as 3.5k data
  • When there is a domain gap between source and target, the quality of data is more important than the quantity to perform SSVP (Self-Supervised Video Pre-training)!


Proposed Method


Revisiting ImageMasked Autoencoders


If you don't know theMAE paper, it's a good idea to understand it before moving on.

MAE is a model that utilizes an asymmetric encoder-decoder structure to make the input 𝛪 and output image 𝛪^ identical. The sequence of the model is as follows,

  1. first split the input image 𝛪 into overlapping 16x16 patches.
    If the image size is256x256, it is divided into 16 horizontal and 16 vertical slices, resulting in16x16 patches
    Or, if the image size is 320x320, it will be divided into 20 horizontal and 20 vertical parts to get20x20 patches.
  2. Mask 75% of the patches you get. (How to mask?)
  3. put only the remaining 25% of the patches into the Transformer encoder 𝛷enc
  4. and the decoder 𝛷dec creates an image 𝛪^ of the same dimension as 𝛪.
    (The encoder-decoder is asymmetric because it receives only 25% of the patch to restore the image)
  5. the loss function is solely designed to restore the masked part well
asymmetric encoder-decoder Loss function
asymmetric encoder-decoder Loss function

Characteristics of Video Data

Although MAE has been successful with the above methods, it may not be appropriate to apply them to video. This is because of the following two characteristics of video,

temporal reducdancy and temporal correlation
temporal reducdancy and temporal correlation

  • Temporal redundancy: Because we move very slowly through the video frame, there is a lot of redundant information! This can be seen in the following ways.
  • ~Forces the model to focus on reconstructing still or slowly moving objects
  • ~makes it too easy for the model to reconstruct them.
  • Temporal correlation: Video can be interpreted as a temporal extension of a static object. Therefore, too much information can be obtained from neighboring frames! Therefore
  • ~Figure 2 frame masking (b), random masking (c) becomes a trivial task and halves the effectiveness of SSVP.

Video Masked Autoencoders (VideoMAE)


Now let's see how this paper proposes to use Self-supervised Learning for Masked Autoencoders.


Temporal Downsampling


  1. given a video V, pick t consecutive frames at random
  2. sample t frames with T times (stride = 4 for Kinetics, 2 for Something-Something)


Cube Embedding


In ViT, 16x16 was a patch, but this time, 2x16x16 is a token embedding, including the timebase. If you look at the code, you can see that we use 3D Convolution instead of 2D Convolution - VideoMAE - Github


class PatchEmbed(nn.Module):
    """ Image to Patch Embedding
    def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768, num_frames=16, tubelet_size=2):
        img_size = to_2tuple(img_size)
        patch_size = to_2tuple(patch_size)
        self.tubelet_size = int(tubelet_size)
        num_patches = (img_size[1] // patch_size[1]) * (img_size[0] // patch_size[0]) * (num_frames // self.tubelet_size)
        self.img_size = img_size
        self.patch_size = patch_size
        self.num_patches = num_patches
        self.proj = nn.Conv3d(in_channels=in_chans, out_channels=embed_dim, 
                            kernel_size = (self.tubelet_size,  patch_size[0],patch_size[1]), 
                            stride=(self.tubelet_size,  patch_size[0],  patch_size[1]))

    def forward(self, x, **kwargs):
        B, C, T, H, W = x.shape
        # FIXME look at relaxing size constraints
        assert H == self.img_size[0] and W == self.img_size[1], \
            f"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})."
        x = self.proj(x).flatten(2).transpose(1, 2)
        return x


Tube Masking with Extremely High Ratios

Uncurated random vidoes
Uncurated random videos

Unlike MAE, it is said that extremely high ratio masking is applied to videos, and there are two reasons for this: Temporal redundancy / Temporal correlation.


이 이미지는 대체 속성이 비어있습니다. 그 파일 이름은 image-2-1024x687.png입니다

이 이미지는 대체 속성이 비어있습니다. 그 파일 이름은 image-3.png입니다

The first image is the masking ratio - accuracy graph of ImageMAE, and the second image is the masking ratio - accuracy graph of VideoMAE. They look similar, but if you look closely, you can see that the performance drop occurs at 75% on the left and95% on the right.

  • Temporal redundancy: Unlike images, video doesn't require a single frame of information to be understood. We can get more information from multiple frames, so we need to make reconstruction more difficult.
    For reference, a graph of performance as a function of masking percentage in images can be found in the comparison with the MAE paper.
  • Temporal correlation: Pixels in the same location have the same masking status along the time axis. To make it easier to understand, imagine a pile of precisely enclosed and stacked roof tiles with a very high percentage (90%, 95%) of the tiles penetrated by a pillar. Note that the pillars are long enough to penetrate all of the tiles.

And the distribution of the masks follows a Bernoulli distribution.

이 이미지는 대체 속성이 비어있습니다. 그 파일 이름은 image-4.png입니다


Backbone: JointSpatio-temporal Attention


After masking most of the 90% to 95%, we don't have many tokens left, so we applied Joint Spatio-temporal Attention to the underlying ViT (see the ViViT paper) in order to get a good reconstruction of the video.


The only downside is that this Attention method has quadratic complexity, but since we've already masked almost all of the original video and there's not much left, it's not computationally intensive.




이 이미지는 대체 속성이 비어있습니다. 그 파일 이름은 image-5-1024x742.png입니다
Something-Something V2 and Kinetics-400

Looking at the results in the paper, there are a few things that come to mind, just to name a couple.


Mask sampling

  • tube performed better than random
  • However, it would be nice to see a performance trend for masking ratio in the case of random, because for random with extremely high ratio, there may be too little information available from neighboring pixels along the time axis, so lowering the masking ratio a bit (even more than for tube) may actually be beneficial.


Loss function

  • MSE loss (L2 loss)performed the best. Although there is not much difference, L2 loss is still the best performer in the reconstruction task, which is due to the fact that the
    - Convex function
    - Differentiable
    - The ability to find maximum likelihood over Gaussian noise


Gaussian is a typical unimodal distribution, and by optimizing it, we end up with the most"plausible" image reconstruction. In the field of image reconstruction, this has the tendency to blur the results, which may be due to the fact that in video, it is more important to understand the behavior as a whole, and not so much to reconstruct the image pixel by pixel. This makes sense when you consider that even if a person looks a bit blurry, you can still get a general idea of what he's doing.

- But it doesn't make much difference.




[Paper Review]VideoMAE - Masked Autoencoders are Date-Efficient Learners for Self-supervisedVideo Pre-Training

Talk to Expert