Contrastive pre-training applies the idea of CLIP to video. In contrastive learning, even similar videos are discriminated as negative except for the correct answer, and various video language-related learning such as VideoQA as well as retrieval are performed according to the video language model. The video encoder used was pre-trained with the HowTo100M dataset, and it achieved some results in zero-shot video text understanding by borrowing CLIP's method of learning text-video associations rather than learning with direct objectives.
As contrastive pre-training centered on NLP models has become widely popular, zero-shot learning was born to reduce the burden of fine tuning. GPT series has achieved significant results on zero-shot through in-context learning, and CLIP has imperfectly realized zero-shot between image and text through contrastive learning. In this paper, we tried to learn by borrowing CLIP's method for video-text zero-shot. There are two key techniques we used.
Suppose you have an open-domain QA bot, you cannot effectively handle all domains with simple LM. Therefore, it will be more effective in terms of speed and performance to provide external knowledge to help LM. In other words, information selection through retrieval is added in addition to the originalLM. Several dinosaur companies are doing reinforcement learning using this.
Both the video encoder and text encoder use the Transformer structure. However, in the case of video, we feed the video to the pre-trained Video Encoder before taking it as input to the Transformer, and pass the output through the MLP we trained to match the shape of the Transformer. The output is then passed through the transformer that was also trained. In this case, the Video Encoder's transformer takes only the first 6 layers of BERT's structure, and the Text encoder uses BERT as it is. It is also somewhat unusual in that it uses BERT's CLS for contrastive loss between the two transformers, but averages over the entire sequence.
**VideoEncoder is a CNN-based video encoder called S3D pre-trained with HowTo100M (output dimension is 512, 30fps).
**MLP trainable from 512 to 768 to match the size of BERT.
**Video is 30fps, generating one video token per second
**VideoEncoder is 34 including CLS and SEP (max 32 seconds of video), TextEncoder is 63including CLS and SEP as max sequence length (we set this length because the average number of text tokens in Value + HowTo100M is about 2.4 per second, so we set this length based on empirical evidence)
We use InfoNCE, which is a typical NCE that swaps text-to-video and video-to-text to learn from each other.
There is already a lot of good reporting on NCEs, so I won't go into it.
**Link to NCE papers
For the loss of the original CLIP, we also added text-to-video and video-to-text to create a loss.
The loss code for the original CLIP is shown below::
This is one of the two Video Text Understanding techniques we talked about in the introduction. It involves sampling subtitles and video separately and then combining them to use training data, which is done in the following order.
The idea is to prevent the subtitles from becoming less relevant to the video, and here's how they explain it
Suppose a chef comes on and says, "I'm going to show you how to make fried rice." Is this video and subtitle aligned correctly? For the sake of representation of video and text, it would be more correct to say that the next shot is of the chef actually making fried rice (as it would be more appropriate for the word fried rice and the word make).
Using FAISS for similarity comparison
We tested zero-shot on Youcook2, MSR-VTT, and DiDeMo datasets. Random retrieve was also used as a baseline. The results show that the performance is generally not significantly different from SOTA on VALUE.
Assuming that video inputs learn associations with text, we aligned video tokens with pre-defined labels like NER, assuming that each token is similar to the corresponding text.
We evaluated the above method with the COIN dataset, which is a video classification dataset that takes YouTube videos, cuts them into segments, and labels each segment.
I was personally very surprised that this result came out even though the average pooling was learned with NCE Loss. I think there is a lot of room to think about this.
It is said that it took one day with 8 X V100 when learning with NCE Loss after outputting the required embedding in advance through a pre-trainedVideoEncoder. The extracted feature can be downloaded from How to 100M here.
This is the first paper on contrastive-learning like CLIP using a frozenVideoEncoder. It is not difficult to understand and is neat, but I still don't intuitively understand the use of AveragePooling instead of CLS. Average pooling of discrete values of each token makes sense from a video perspective, but not from a text perspective. If something similar appears elsewhere in the future, I think I will think about it more carefully.