What is Vision Transformer? - Definition, Implementation, Application

Overcoming CNN's limitations the NLP's method

Sangsun Moon
What is Vision Transformer? - Definition, Implementation, Application

With computer vision making increasingly revolutionary advances and being applied to many areas of our daily lives, Vision Transformer is one of the key technologies that has become of high interest to researchers and developers. In this article, we will explore what Vision Transformer is, its importance, principles, and use cases.

What is Vision Transformer (ViT)?


ViT: Vision Transformer
ViT: Vision Transformer. Transformers for image recognition at… | by Shivani Junawane | Machine Intelligence and Deep Learning | Medium

Vision Transformer (ViT) is a deep learning model that performs well on image recognition and processing tasks, applying the transformer structure that has been successful in the field of natural language processing (NLP) to computer vision. By using the Transformer structure instead of traditional Convolutional Neural Networks (CNNs), global image information can be properly captured and processed.

The Vision Transformer captures global information and interacting elements in an image, and has shown high performance in a variety of computer vision tasks. It is a model that has been widely used in many fields recently because it can be adapted to a wide variety of problems and datasets.


Architecture using the Transformer structure has seen a number of SOTAs, with the paper on ViT being the starting point. It's been researched with the goal of learning from more data at a lower cost, and since its release in 2021, it's been hailed as a game-changer in image classification. For a detailed paper on Vision Transformer, please see at this link.

Comparing CNNs and ViTs from a computer vision perspective

Overview and definitions

Convolutional Neural Networks (CNNs) are the most popular deep learning models in the field of computer vision, used for image recognition, object detection, medical image processing, and many other vision tasks. CNNs recognize spatial hierarchy and the It consists of components that include a Convolutional Layer and a Pooling Layer to capture regional patterns.

In contrast, Vision Transformer (ViT) uses Transformer is a model applied to the field of computer vision. Transformer는 주로 sequence data, ViT works by applying it to an image, taking into account information across the image.

Differences and features

ViTT: Vision Transformer Tracker
ViTT: Vision Transformer Tracker

ViT splits the image into small patches, vectorizes the patches, and uses Transformer's encoder to extract contextual information within the image. Because of these changes, ViTs and CNNs differ in their basic structure and behavior.

CNNs have a structure suitable for local feature extraction, exploring relationships between neighboring pixels while preserving the spatial information of the image. ViT, on the other hand, considers the global information of an image on a patch-by-patch basis. To do this, it learns contextual information through Transformer's Attention mechanism, which is applied to tasks like image classification.

CNN models have a compact size and efficient memory utilization structure, making them ideal for resource-constrained environments. Lightweight and fast image processing systems can be built on top of them, and they have shown good accuracy in a variety of computer vision applications. Vision Transformers, on the other hand, have the advantage of being a powerful approach to capturing global dependencies and contextual understanding in images, allowing performance improvements to be seen on specific tasks.

However, Vision Transformer has the disadvantage of large model sizes and high memory requirements. This allows them to achieve remarkable accuracy when dealing with large data sets, but has the limitation of being somewhat less practical in resource-constrained scenarios.

To summarize, CNNs are models that specialize in spatial hierarchy and local feature extraction, while ViTs leverage a transformer structure that transforms images into patches and processes them with global information and context. Ultimately, if you have to choose between the two models, your decision should be based on the specific requirements of the task at hand, taking into account factors such as your available resources, the size of your dataset, and the balance between model complexity, accuracy, and performance.

Vision Transformer Principle and Structure

How it works

The structure of the Vision Transformer includes the following elements

 PyTorch Paper Replicating
08. PyTorch Paper Replicating - Zero to Mastery Learn PyTorch for Deep Learning

Image patching: Vision Transformer divides the input image into small square patches of fixed size. Typically, an image is preprocessed to a size of (224 x 224), which is then divided into patches of size 16 x 16. These cropped patches are then fed into a neural network.

  1. Patch vectorization and positional information: Each patch is converted into a flat (one-dimensional) vector, while at the same time adding information about the position of each patch. This allows the transformer to understand the positional information for each part of the image.
  2. Transformer Encoder: The vectorized and geolocated patches are used as input to the Transformer encoder. The encoder uses self-attention and feed-forward neural networks to extract complex patterns and contextual information from the image.
  3. Classification token and MLP head: During patch vectorization, a special Add a Classification Token to the input. This token is transformed into a vector containing the information for each patch as it passes through the encoder, and the output of the Transformer encoder is used to perform the classification using this portion of the token. An MLP head is used for this, which transforms the vector into the final classification result with a softmax activation function applied.

However, when trained on a medium-sized dataset with this process, it was found that it did not perform as well as traditional ResNets without strong regularization. This was attributed to the limitation that the Transformer structure itself lacks inductive bias compared to the aforementioned CNNs, which are characterized by poor generalization without a large amount of data.

However, the above structural limitations can be overcome when training with large and massive amounts of data (around 14 million to 300 million images). Therefore, Vision Transformer has a simpler overall structure compared to CNN, but its performance can be confirmed on large amounts of data.

How to implement

A typical PyTorch and TensorFlow are mentioned as deep learning frameworks. Each framework has its own unique features and pros and cons.

👉 Read more about MLOps comparison analysis


  • Dynamic graphs: You can define and run models using dynamic computational graphs. This makes it easy to make changes and adjustments to your model, and simplifies tasks such as debugging and experimentation.
  • Intuitive syntax: We provide a Pythonic syntax to make defining and training models simple. This provides an intuitive experience for users and makes your code more readable and maintainable.
  • Tools and community: We have a variety of useful tools and libraries and an active community. These can help with model development, data preprocessing, visualization, and more.

PyTorch Vision Transformer
08. PyTorch Paper Replicating - Zero to Mastery Learn PyTorch for Deep Learning

An example of an actual code implementation of the Vision Transformer paper can be found at this link. It takes the diagrams, math equations, and descriptive text described in the paper and implements them in code to do the work of categorizing through images of food;


  • Static graphs: TensorFlow uses static computational graphs to define and execute models. This makes it difficult to change the graph once it is defined, and helps with portability and performance optimization.
  • Distributed training and deployment: Provides features to support distributed training. Enables efficient training of models across multiple GPUs or distributed computing clusters, and facilitates deployment and model serving.
  • TensorFlow Hub: A platform called TensorFlow Hub makes it easy to share and utilize pre-trained models and model components. This improves the efficiency of model development and helps you explore different model architectures.

Each framework can be chosen based on developer preference and project needs. PyTorch has a flexible and intuitive design and community strength, while TensorFlow has features that facilitate distributed learning and serving. Whichever framework you choose to implement ViT, you'll want to learn its syntax and features to build and train your model.

Vision Transformer Use Cases

As datasets continue to grow in size and learning methods diversify, it becomes important to develop vision architectures that can efficiently train on multiple datasets. In this context, ViT has been used in a variety of computer vision tasks to Wide applicability and Scalability of the architecture. Below you'll find a variety of applications and use cases where ViT has been applied.

Image Classification

New ViT and ALIGN Models From Kakao Brain
New ViT and ALIGN Models From Kakao Brain

Until now, CNN has been considered the best model for performing image classification tasks. Since the development of ViT, it has been touted as an alternative to CNN. Of course, its performance is not comparable on small and medium-sized datasets, but it can certainly outperform CNN on very large datasets.

KaKao Brain and Hugging Face's "ViT Align" project started with the goal of studying the impact of Vision Transformer (ViT) models on image classification techniques, and comparing ViT and CNN to explore the advantages and disadvantages of these two models.

In this project, we experimented with the Imagenette and Imagewoof datasets, which have a larger number of classes in ImageNet-1k, unlike previous studies that had difficulty balancing the classes in the dataset. This allowed us to validate that the ViT model works well on large datasets, and to compare its performance with the CNN model as a baseline.

In our experiments, the ViT model outperformed the CNN model by achieving 87.7% top-1 accuracy on the Imagenette dataset and 78.2% top-1 accuracy on the Imagewoof dataset. In particular, ViT outperformed CNN on the Imagenette dataset with class imbalance, indicating that the ViT model has good generalization ability on large datasets.

Through this project, we gained a better understanding of the image classification technology of the Vision Transformer model, and the performance comparison between the ViT and CNN models highlighted the importance of choosing the right model by referring to the advantages and disadvantages of both models. We also learned about the influence of the ViT model in the field of image classification.

Image Captioning

Vision Transformer Image Captioning
Vision Transformer: What It Is & How It Works [2023 Guide]

Image captioning is a deep learning technique that takes an image as input and generates a natural language caption (title, description, etc.) for that image. It combines the fields of image classification and natural language processing, and works by training a predictive model that fuses computer vision and language modeling techniques.

Unlike the traditional model, the Vision Transformer model uses Transformer blocks to process the input image. It processes the input image by dividing the image into patches and transforming each patch into a new dimension. These patches are meant to represent the image as a sequence of words. In this way, the features of the input image can be extracted, the overall characteristics of the image can be defined, and captions can be generated using models such as RNN, LSTM, and GRU based on the input natural language sequence. 

Image Segmentation

Vision Transformer Image segmentation

A Transformer-based deep learning model used to perform image segmentation tasks. Image segmentation is the process of analyzing an image pixel by pixel and assigning labels to individual objects that contain information about their boundaries. Dense Prediction Transformers (DPT) can effectively solve this image segmentation problem using the Transformer architecture.

The structure of a DPT is basically very similar to a Vision Transformer (ViT). The main difference is that DPT generates predictions from individual pixels in order to segment the components of an image. These predictions are used to separate objects in the image while preserving the spatial information of the original image. Instead of dealing with each image patch independently, the Encoder-Decoder structure can be used to create examples that take into account the context of the entire image, which can lead to high-performance image segmentation results.

The MAE pre-trained Vision Transformer model was also used in the Image Encoder of the Segment Anything Model (SAM) recently released by META;

👉 Want to learn more about the Segment Anything Model?

Anomaly Detection

Vision Transformer Anomaly Detection

Vision Transformer-based anomaly detection and localization networks combine a reconstruction-based approach with patch embedding. Transformer networks can help preserve the spatial information of embedded patches.

Autonomous driving

Transformer Model with Cross-Attention Module

At the 2021 Tesla AI Day event, Tesla revealed more details about the inner workings of the neural networks that power its self-driving technology, Tesla Full Self-Driving (FSD). One of the most interesting of these is the "Image to EV Transform + Multi-Camera Fusion" technology. The technology is organized around a Transformer model with a cross-attention module.

Conclusion: The importance of the Vision Transformer model in computer vision will only grow.

Why computer vision should adopt Vision Transformer

Transformers for Image Recognition at Scale (Vision Transformers)
An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale (Vision Transformers)

CNNs have been useful on large amounts of image data by extracting features from images of different sizes and combinations. However, CNNs have historically suffered from the difficulty of capturing context and long-range dependencies. ViT is an effective alternative that addresses these issues while providing accuracy and scalability. Here are some specific reasons to adopt Vision Transformer in computer vision.

  1. Scale-Effective Performance: Vision Transformer drove significant performance improvements in the relationship with image size and model complexity. Selecting larger model and image sizes does not degrade performance, and it gets better as the dataset size increases.
  2. Simple structure: Vision Transformer uses one paradigm for object recognition, segmentation, detection, and more. It can replace complex CNN structures, and the overall structure is simple and easy to understand.
  3. Possibilities for self-supervised learning: Vision Transformer is a can be effectively combined with self-supervised learning algorithms. It performs well on unlabeled image classification tasks with full-sequence attention between image patches.
  4. Performance of transfer learning: Pre-trained on the ImageNet benchmarks, Vision Transformer is highly adaptive not only to image classification tasks but also to various downstream tasks. This allows it to gain greater benefit from transfer learning.
  5. Interpretability and visualization: Vision Transformer's attentional-based structure facilitates visual interpretation and insight extraction of the model's predictions. This allows you to visualize how the image determined, and get some intuition as to why the prediction failed.

The future and value of Vision Transformer

Compared to traditional computer vision techniques, Vision Transformer has several advantages, such as performance, simplicity, self-supervised learning, transfer learning, and interpretability. Vision Transformer is also scalable to improve with further research, and is actively being commercialized in many areas.

This means that Vision Transformer will provide high-quality vision recognition performance and an easy-to-understand structure for users, enabling improved performance for various applications. These technological advances are expected to bring innovation in various fields and contribute to improving the quality of life. DATAHUNT is also interested in the changes and development of Vision Transformer and will continue research and development.

Talk to Expert