Let's take a look ata paper review of the Video Swin Transformer, which is an extension of the SwinTransformer that achieved SOTA (State-of-the-Art) in the image domain to video.
- Paper Link: Video Swin Transformer
- Authors /Organization: Microsoft
- Github: https://github.com/SwinTransformer/Video-Swin-Transformer
Abstract
Recently, the trend of computer vision models has been moving from CNN to Transformer. While the existing Transformer model is based on a transformer layer that globally connects patches in the spatiotemporal dimension, this paper borrows the structure of Swin Transformer, an image domain model, and applies self-attention locally in a video transformer. As a result, SOTA is achieved in video recognition tasks including action recognition.
1. Introduction
The success of VisionTransformer (ViT) for images has led to research on applying Transformer structures to video-based recognition tasks, and models such as ViViT, MTN, and TimeSFormer have emerged. The model in this paper is also a transformer structure, and its contribution is as follows.
- Performs computation locally instead of performing self-attention across the video based on the assumption that neighboring time points (=frames) or locations (=pixels) in the video have similar values (spatiotemporal locality).
- Experimentally found that setting Backbone's learning rate to 0.1 times that of the head leads to higher performance.
As a result, it showed higher performance than the existing SOTA in the Video Recognition Task(Kinetics-400/600) while reducing the amount of computation and model size.
2. Video Swin Transformer
The model used in this paper is a straightforward adaptation of the Swin Transformer structure, with the only difference being that it is extended by one dimension, the time axis, for video application. Therefore, before looking at the Video Swin Transformer, we will first briefly review the Swin Transformer structure.
-----------------------
Swin Transformer(Shifted WINdow Transformer)
Background
Problems with the existing Vision Transformer (ViT)
- Does not reflect image characteristics by using a fixed patch(=token) → Consider image resolution and object size X
- Computation increases quadratically as the number of patches increases by applying global self-attention

Model architecture
To compensate for the problems of ViT, a hierarchical structure is applied to be flexible to resolution and object size.
- Processing a larger area at once through patch merging while going from Stage 1 to 4
Swin Transformer Block
- W-MSA (Window Multi-head Self Attention)
- Apply local window to perform self-attention between patches within a window
→Since the number of patches in the window is fixed, the computation amount increases linearly with the image size.
- SW-MSA (Shifted Window Multi-head Self Attention)
- Performs self-attention by cyclically shifting the window by M/2 to consider the information of neighboring pixels located on the window boundary (M = number of patches in the window).
→Areas that were not adjacent to each other in the original image are masked so that they are not reflected.

That was a brief explanation of Swin Transformer, now let's get back to the point and take a look at Video Swin Transformer.
-----------------
2.1 OverallArchitecture
The difference from the original model is the addition of a time dimension (T) and the extension ofW-MSA (Window Multi-head Self-Attention) and SW-MSA (Shifted Window Multi-headSelf-Attention) to 3D.

vs

2.1.1 pyTorch code
[Swin Transformer]
[Video Swin Transformer]
Comparing the code of the two models, they have the same structure except for the two differences below.
- Addition of a time axis window_size (7, 7) →(2, 7, 7)
- WindowAttention → WindowAttention3D
Also, one of the main ideas, cyclic shift, differs only by the addition of one more dimension.
2.2 3D Shifted Window based MSA Module

The overall logic is similar, as we have extended the Swin Transformer structure from 2D to 3D.

2.3 Architecture
Composed of 4 different versions, the size and computational complexity of each model is the same as the base model: x0.25, x0.5, x1, x2
- Swin-T : C=96, layer numbers = {2, 2, 6, 2}
- Swin-S : C=96, layer numbers = {2, 2, 18, 2}
- Swin-B : C=128, layer numbers = {2, 2, 18, 2}
- Swin-L : C=192, layer numbers = {2, 2, 18, 2}
where C is the channel number of the hidden layer in stage 1, and we used the window size P =8 and M = 7 as default.
Furthermore, the query dimension of each head is d = 32 and the expansion of the MLP layer is Ɑ = 4.
2.4 Initialization from Pre-trained Model
The same as Swin Transformer, we initialized with a pre-trained model trained on a large dataset, but the following two blocks have a different shape from the existing model, so we processed them separately.
Linear embedding layer
- The time dimension was halved in Stage 1 (T/2) and the number of channels was doubled (48 → 96).
→Duplicate the weights from the pre-trained model twice and multiply the entire matrix by 0.5 (this affects the mean and variance of the output x)
Video Swin Transformer block's relative position biases
- (2M - 1, 2M - 1) → (2P - 1, 2M - 1, 2M - 1)
- Initialize the matrix of the pre-trained model by cloning it (2P - 1) times to make the relative position biases the same within each frame
2.4.1 pyTorch code
As we saw above, the model in this paper uses weights from a model pre-trained with images, so we need to scale the dimensions to fit video. If you look at the official code, this is implemented as a function called inflate_weights. You can see that we have replicated the use of WEIGHTS as described in the paper.
3. Experiments
3.1 Datasets
The datasets for the experiments were organized as follows, and the performance evaluation was done as TOP-1 and TOP-5.
Human action recognition
- Kinetics-400: 400action categories, 240k(train) / 20k(val) video
- Kinetics-600: 600action categories, 370k(train) / 28.3k(val) video
Temporal modeling
- Something-SomethingV2 (SSv2): 174 classes, 168.9K(train) / 24.7K(val) video
3.2 Comparison to state-of-the-art
The experimental results show that Kinetics and Something-Something V2 achieved SOTA, respectively.

3.3 Ablation Study
Different designs for spatiotemporal attention
We also conducted an experiment to compare the performance of three different designs of spatiotemporal attention (Joint / Split / Factorized). Each design is described below.
- Joint: (default): Computes spatiotemporal attention in each 3D windows-based MSA layer.
- Split: Adds two temporal transformer layers on top of a spatial-only Swin Transformer.
(This has proven useful in ViViT and VTN) - Factorized: Add atemporal-only MSA Layer after each spatial-only MSA Layer in the Swin Transformer.
(method used in TimeSformer) - The results show that the joint version performed the best.

Temporal dimension of 3D tokens & Temporal window size
- The following is a performance comparison experiment by adjusting the temporal dimension and window size.

- In our experiments with temporal dimension, we found that the larger the dimension, the higher the Top 1-Acc. However, the larger dimension also increased the computational cost and slowed down the inference speed.
- In the window size comparison experiment, we can also see that the performance improves as the window size increases (+0.3%), but the performance gain comes at a significant cost in computation (+17%).
3D shifted windows
We also ran an experiment to show the benefits of applying 3D shifted windows, which also resulted in a performance improvement.

Ratio of backbone/head learning rate
The results of this experiment seem to be a result of what we found while tuning the parameters. In the table below, you can see that the performance is better when the learning rate of the backbone is 0.1 times that of the head.

4. Conclusion
The final conclusion. In this paper, we proposed an extension of the existing Swin Transformer to 3D to utilize pre-trained image models while reducing the amount of computation. As a result, we achieved SOTA on three widely used benchmarks(Kinetics-400, Kinetics-600, and Something-Something v2), showing the possibility of extending/adapting image models to video.
This has been a review of papers on Video Swin Transformer.
Video swin transformer Reference