Paper reivew | VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

zero-shot Video Text Understanding

Suho Cho
Paper reivew | VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding



Contrastive pre-training applies the idea of CLIP to video. In contrastive learning, even similar videos are discriminated as negative except for the correct answer, and various video language-related learning such as VideoQA as well as retrieval are performed according to the video language model. The video encoder used was pre-trained with the HowTo100M dataset, and it achieved some results in zero-shot video text understanding by borrowing CLIP's method of learning text-video associations rather than learning with direct objectives.




As contrastive pre-training centered on NLP models has become widely popular, zero-shot learning was born to reduce the burden of fine tuning. GPT series has achieved significant results on zero-shot through in-context learning, and CLIP has imperfectly realized zero-shot between image and text through contrastive learning. In this paper, we tried to learn by borrowing CLIP's method for video-text zero-shot. There are two key techniques we used.

  • Increasing the relevance of video-text
    Existing video-text models are trained by cutting videos at certain time intervals, which leads to the model being trained with a large number of videos that do not actually match the meaning of video-text. To improve this, the method devised in the paper is overlapped video-text clips.

  • Tighten the negative pairs when learning contrastive loss.
    They found that retrieval augmented pre-training was more effective at finding negative pairs datasets.


**What is Retrieval Augmented pre-training?

Suppose you have an open-domain QA bot, you cannot effectively handle all domains with simple LM. Therefore, it will be more effective in terms of speed and performance to provide external knowledge to help LM. In other words, information selection through retrieval is added in addition to the originalLM. Several dinosaur companies are doing reinforcement learning using this.

Retrieval Augmented training
Example image of Retrieval Augmented training for QA - The paper



VideoCLIP Contrastive pre-training


Video and Text Encoding


Both the video encoder and text encoder use the Transformer structure. However, in the case of video, we feed the video to the pre-trained Video Encoder before taking it as input to the Transformer, and pass the output through the MLP we trained to match the shape of the Transformer. The output is then passed through the transformer that was also trained. In this case, the Video Encoder's transformer takes only the first 6 layers of BERT's structure, and the Text encoder uses BERT as it is. It is also somewhat unusual in that it uses BERT's CLS for contrastive loss between the two transformers, but averages over the entire sequence.

VideoCLIP 모델 개요
VideoCLIP model block diagram


**VideoEncoder is a CNN-based video encoder called S3D pre-trained with HowTo100M (output dimension is 512, 30fps).

**MLP trainable from 512 to 768 to match the size of BERT.

**Video is 30fps, generating one video token per second

**VideoEncoder is 34 including CLS and SEP (max 32 seconds of video), TextEncoder is 63including CLS and SEP as max sequence length (we set this length because the average number of text tokens in Value + HowTo100M is about 2.4 per second, so we set this length based on empirical evidence)




We use InfoNCE, which is a typical NCE that swaps text-to-video and video-to-text to learn from each other.

There is already a lot of good reporting on NCEs, so I won't go into it.

**Link to NCE papers



For the loss of the original CLIP, we also added text-to-video and video-to-text to create a loss.

The loss code for the original CLIP is shown below::

원본 CLIP의 Loss 코드
Loss code for the original CLIP


Overlapped Video-Text Clips


This is one of the two Video Text Understanding techniques we talked about in the introduction. It involves sampling subtitles and video separately and then combining them to use training data, which is done in the following order.

  1. sample the subtitle portion of the video first
  2. randomly sample up to 32 seconds of the video, centered on the start and end times of the subtitles.


The idea is to prevent the subtitles from becoming less relevant to the video, and here's how they explain it


Suppose a chef comes on and says, "I'm going to show you how to make fried rice." Is this video and subtitle aligned correctly? For the sake of representation of video and text, it would be more correct to say that the next shot is of the chef actually making fried rice (as it would be more appropriate for the word fried rice and the word make).


RetrievalAugmented Training


Using FAISS for similarity comparison

  1. compute an embedding feature for each video (the average of the embeddings for all video-text clips in that video)
  2. construct a dense index for all videos
  3. extract 2k videos similar to the video to be trained
  4. randomly sample k out of the 2k videos and use them as training batches (whereK is the batch size)
  5. all values other than label in the batch are negative samples



Text to Video understanding Retrieval


We tested zero-shot on Youcook2, MSR-VTT, and DiDeMo datasets. Random retrieve was also used as a baseline. The results show that the performance is generally not significantly different from SOTA on VALUE.

Youcook2, MSR-VTT, DiDeMo데이터셋에 대한 zero-shot 테스트
Zero-shot tests on Youcook2, MSR-VTT, and DiDeMo datasets


Action Segmentation


Assuming that video inputs learn associations with text, we aligned video tokens with pre-defined labels like NER, assuming that each token is similar to the corresponding text.


videoCLIP 비디오 토큰을 pre-defined된 label과 align
Aligning video tokens with predefined labels


We evaluated the above method with the COIN dataset, which is a video classification dataset that takes YouTube videos, cuts them into segments, and labels each segment.

COIN Dataset 예시
COIN Dataset Example

Action segmentation on COIN
Action segmentation on COIN


I was personally very surprised that this result came out even though the average pooling was learned with NCE Loss. I think there is a lot of room to think about this.


Training time


It is said that it took one day with 8 X V100 when learning with NCE Loss after outputting the required embedding in advance through a pre-trainedVideoEncoder. The extracted feature can be downloaded from How to 100M here.



Conclusion and Discussion of Contrastive pre-training


This is the first paper on contrastive-learning like CLIP using a frozenVideoEncoder. It is not difficult to understand and is neat, but I still don't intuitively understand the use of AveragePooling instead of CLS. Average pooling of discrete values of each token makes sense from a video perspective, but not from a text perspective. If something similar appears elsewhere in the future, I think I will think about it more carefully.

Talk to Expert