<!DOCTYPE html>
<html lang="en-us">

  <head>
  <link href="http://gmpg.org/xfn/11" rel="profile">
  <meta http-equiv="content-type" content="text/html; charset=utf-8">

  <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1">

  <title>
    
      The ICLR Blog Track &middot; 
    
  </title>

  
  <link rel="canonical" href="https://iclr.iro.umontreal.ca/31b016ca-c229-4822-97e2-7300c3798230_1642191585/blog/">
  

  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/31b016ca-c229-4822-97e2-7300c3798230_1642191585/public/css/poole.css">
  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/31b016ca-c229-4822-97e2-7300c3798230_1642191585/public/css/syntax.css">
  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/31b016ca-c229-4822-97e2-7300c3798230_1642191585/public/css/lanyon.css">
  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/31b016ca-c229-4822-97e2-7300c3798230_1642191585/public/css/custom.css">
  <link rel="stylesheet" href="https://fonts.googleapis.com/css?family=PT+Serif:400,400italic,700%7CPT+Sans:400">

  <link rel="apple-touch-icon-precomposed" sizes="144x144" href="https://iclr.iro.umontreal.ca/31b016ca-c229-4822-97e2-7300c3798230_1642191585/public/apple-touch-icon-precomposed.png">
  <link rel="shortcut icon" href="https://iclr.iro.umontreal.ca/31b016ca-c229-4822-97e2-7300c3798230_1642191585/public/favicon.ico">

  <link rel="alternate" type="application/rss+xml" title="RSS" href="https://iclr.iro.umontreal.ca/31b016ca-c229-4822-97e2-7300c3798230_1642191585/atom.xml">

  

  <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript" ></script>
 <!-- <script type="text/x-mathjax-config"> MathJax.Hub.Config({ TeX: { equationNumbers: { autoNumber: "AMS" } } }); </script> -->
  <script type="text/x-mathjax-config">
      MathJax.Hub.Config({
        tex2jax: { inlineMath: [ ['$','$'], ["\\(","\\)"] ],
         processEscapes: false
        }
      });
</script>
</head>


  <body>

    <!-- Target for toggling the sidebar `.sidebar-checkbox` is for regular
     styles, `#sidebar-checkbox` for behavior. -->
<input type="checkbox" class="sidebar-checkbox" id="sidebar-checkbox">
<!-- <input type="checkbox" class="sidebar-checkbox" id="sidebar-checkbox" > -->

<!-- Toggleable sidebar -->
<div class="sidebar" id="sidebar">
  <div class="sidebar-item">
    <p>For short-term, peer-sourced tests of time, generalizations, specializations, reproductions, etc.!</p>
  </div>

  <nav class="sidebar-nav">

    

    
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/31b016ca-c229-4822-97e2-7300c3798230_1642191585/">ICLR 2022 Blog Track</a>
        
      
    
      
        
      
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/31b016ca-c229-4822-97e2-7300c3798230_1642191585/about/">About</a>
        
      
    
      
    
      
        
      
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/31b016ca-c229-4822-97e2-7300c3798230_1642191585/submitting/">Submitting</a>
        
      
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/31b016ca-c229-4822-97e2-7300c3798230_1642191585/tags/">Tags</a>
        
      
    

    <a class="sidebar-nav-item" href="https://github.com/iclr-blog-track/iclr-blog-track.github.io">GitHub project</a>
    <span class="sidebar-nav-item">Currently vICLR Spring 2021</span>
  </nav>

  <div class="sidebar-item">
    <p>
      &copy; 2022. All rights reserved.
    </p>
  </div>
</div>


    <!-- Wrap is the content to shift when toggling the sidebar. We wrap the
         content to avoid any CSS collisions with our real content. -->
    <div class="wrap">
      <div class="masthead">
        <div class="container">
          <h3 class="masthead-title">
            <a href="/" title="Home">The ICLR Blog Track</a>
            <small></small>
          </h3>
        </div>
      </div>

      <div class="container content">
        <div class="posts">
  
  <div >
    <h2 class="post-title">
      <a href="https://iclr.iro.umontreal.ca/31b016ca-c229-4822-97e2-7300c3798230_1642191585/2021/09/01/Vision-Transformer/">
        Vision Transformer(ViT)
      </a>
    </h2>

    <span class="post-date">01 Sep 2021 | 
    </span>
    <span class="post-date"></span>

    <!-- <p>Blog post is based on the paper <a href="https://arxiv.org/abs/2010.11929">“An image is worth 16x16 words: Transformers for image recognition at scale.”</a> by Kolesnikov, Alexander, et al. at ICLR 2021. <br /> <br /></p>
<h2 id="background">Background</h2>
<p>In last 10 years there has been significant development in Computer vision after development of Convolutional neural network(ConvNets). Albeit development of Convolutional neural networks dates back to 1980s when the idea of connectionism or parellel distributed processing (Rumelhart et al., 1986e; McClelland et. al., 1995) came into view during second wave of neural network research. Many ideas were revived from the work of pschologist Donald Hebb (Hebb, 1949). The main point in connectionism is that many simple computing units can achieve intelligent network behavior. An individual unit or small set of units is of little use. Due to connectionism, several concepts came to light during 1980s which remain in focus in today’s deep learning such as distributed representation, backpropogation to train deep neural network.<a href="https://proceedings.neurips.cc/paper/1989/file/53c3bce66e43be4f209556518c2fcb54-Paper.pdf"><sup>1</sup></a><br /><br />
After transformers where introduced in NLP in 2017 replacing recurrent neural network, many architectures involving self-attention(transformer model) has made progress in area of time series forecasting<a href="https://arxiv.org/abs/2002.06103"><sup>4</sup></a>, graph based model<a href="https://arxiv.org/abs/2106.03893"><sup>3</sup></a>, visual recognition system<a href="https://arxiv.org/abs/2010.11929"><sup>2</sup></a>. Below figure is from <a href="https://arxiv.org/pdf/2103.16775.pdf">Attention, Please! A Survey of Neural Attention Models in Deep Learning</a> represents developments of Attention in Deep learning.
<br /></p>
<p align="center">
<img src="https://iclr.iro.umontreal.ca/31b016ca-c229-4822-97e2-7300c3798230_1642191585/public/images/2021-09-01-Vision-Transformer/Attention_in_DL.png" alt="Attention_in_DL" />
<em>Figure 1: Timeline of work related to attention in Deep learning.</em>
</p>
<p>Convolutional networks have dominated in computer vision tasks. But there has been research around combining self-attention with CNN after success seen in NLP with transformer models. Below graph shows work related to transformers in vision models.</p>
<p align="center">
<img src="https://iclr.iro.umontreal.ca/31b016ca-c229-4822-97e2-7300c3798230_1642191585/public/images/2021-09-01-Vision-Transformer/related_work.png" alt="" />
<em>Figure 2: Related work with transformer in vision model.</em>
</p>
<h2 id="key-design-aspects-of-vit">Key Design Aspects of ViT</h2>
<p>Authors of ViT follows original transformer architecture (Vaswani et al., 2017) for image recognition.<br /></p>
<ul>
  <li>Input embedding is prepared by splitting image into fixed size patched(for example 16x16 or 14x14). Considering an Imagenet image of 224x224 and taking 14x14 patch which gives (224/14 x 224/14=)16x16 patch, which explains title of the paper where we treat each patch as token or word. This patches are then linearly projected and 1D positional embedding is added. This result is then fed to Transformer Encoder.<br /></li>
  <li>Models with small patch size is computationally more expensive as input sequence length size increases with lower patch size.<br /></li>
  <li>Why positional encoding is added? It retains positional Information and more details on 2D positional encoding is given in the paper.
<br /></li>
</ul>
<p align="center">
<img src="https://iclr.iro.umontreal.ca/31b016ca-c229-4822-97e2-7300c3798230_1642191585/public/images/2021-09-01-Vision-Transformer/Vit_architecture.png" alt="Vit architecture" />
<em>Figure 3: Model architecture.</em>
</p>
<p>As seen in above Figure 3, Transformer encoder consists of Multi-head attention layer and MLP layer one after the other. Normalization is applied before each layer since it helps to reduce number of steps needed by gradient descent to optimize the network and when we normalize scale of output is going to be same.<br /></p>
<h3 id="inductive-bias">Inductive bias</h3>
<p>ViT has significantly less image-specific inductive bias compared to the convolutional neural network. In ViT MLP layers are local and translational equivariant whereas self attention layers are global. Positional encoding in ViT induces inductive bias- It is shown in paper that pre-training is done on 224 dimension and fine-tuning is done on higher resolution i.e 384x384, so positional encoding of 16x16 patch is no more useful since number of patches increases and sequence becomes larger. So 2D interpolation is done for positional encoding and inductive bias is introduced.</p>
<h3 id="model-variants-and-results">Model variants and results</h3>
<p>There are 3 variants of ViT: ViT-Base, ViT-Large and ViT-huge. ViT-Large uses 16x16 patch and ViT-huge uses 14x14 patch.<br />
Baseline model considered is modified Resnet(BiT)-by replacing Batch Normalization layers with Group Normalization and standardized convolutions.<br /></p>
<p></p>
<h3 id="comparison-with-state-of-art">Comparison with State of Art</h3>
<p align="center">
<img src="https://iclr.iro.umontreal.ca/31b016ca-c229-4822-97e2-7300c3798230_1642191585/public/images/2021-09-01-Vision-Transformer/Performance.png" alt="Vit architecture" />
</p>
<p>Above figure shows performance of ViT variants with state of the art models. Results of ViT shown here was pretrained on JFT-300M dataset and it outperforms all the state of art models while take less computational resource on pre-training.</p>
<p align="center">
<img src="https://iclr.iro.umontreal.ca/31b016ca-c229-4822-97e2-7300c3798230_1642191585/public/images/2021-09-01-Vision-Transformer/VTAB.png" alt="Vit architecture" />
</p>
<p>VTAB benchmark result shows breakdown on 3 type of data:  Natural(1000 training examples per task, CIFAR), Specialized(medical and satellite images) and structured(tasks which require geometrical understanding) task groups and ViT outperformes previous SOTA.<br /></p>
<p></p>
<h3 id="training-data-size-matter">Training Data size matter</h3>
<p align="center">
<img src="https://iclr.iro.umontreal.ca/31b016ca-c229-4822-97e2-7300c3798230_1642191585/public/images/2021-09-01-Vision-Transformer/result1.png" />
</p>
<p>Vision Transform is trained on increasing size of datasets: ImageNet, ImageNet-21k and JFT-300 M. As shown in above figure when all the model variants pre-trained with ImageNet, ViT-large underperforms. And when it come to ImageNet-21k the performance of ViT-large is similar to BiT(Resnet). But with JFT-300 M, ViT-large model performs better. The BiT CNNs outperforms ViT on ImageNet, but with the larger dataset ViT over takes. They also did an experiment on the training model at random subsets of 9M, 30M and 90M and at last on full JFT-300M.<br /></p>
<h3 id="how-does-vision-transformer-process-the-image-data">How does Vision Transformer process the image data?</h3>
<p align="center">
<img src="https://iclr.iro.umontreal.ca/31b016ca-c229-4822-97e2-7300c3798230_1642191585/public/images/2021-09-01-Vision-Transformer/self_attention.png" />
</p>
<p>The first layer of the vision transformer lowers the dimension of the patch. So, Figure 4 (left) show the top principal component of the linear embedding filter.
Figure 4 (centre) shows that the model learns to encode distance within the image in the similarity of position embeddings, as you see the closer patches tend to have more similar position embeddings<br />
Self Attention allows integrating information across the entire image even in the lowest layers. So the Figure (right) shows the “attention distance” which is similar to a receptive field in CNNs. This is the figure of ViT-L/32 variant, where it has 24 layers and 16 attention heads for each layer. Some attention heads in the 0 layer have global attention and some have local attention.</p>
<h2 id="new-intuitions">New intuitions</h2>
<ul>
  <li>Transformers can be used in face recognition and fine-grain classification. <br /></li>
  <li>Author mention in paper that ‘Transformer appear not to saturate within the range tried, motivating future scaling efforts’ so in future it might be similar to GPT3 for computer vision tasks.</li>
  <li>Recent research on <a href="https://arxiv.org/pdf/2112.13492.pdf">Vision Transformer for small size datasets</a> improves performance for Tiny-Imagenet. Author of this paper proposes Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA), which effectively solves the lack of locality inductive bias.<br /></li>
</ul>
<p></p>
<h2 id="reference">Reference</h2>
<ul>
  <li>Kolesnikov, Alexander, et al. “An image is worth 16x16 words: Transformers for image recognition at scale.” (2021).<br /></li>
  <li>Correia, Alana de Santana, and Esther Luna Colombini. “Attention, please! A survey of Neural Attention Models in Deep Learning.” arXiv preprint arXiv:2103.16775 (2021).<br /></li>
  <li><a href="https://www.youtube.com/watch?v=j6kuz_NqkG0&amp;t=5s">Vision Transformer (ViT) - An image is worth 16x16 words(Paper Explained)</a></li>
</ul>
 -->
    <p>Blog post is based on the paper <a href="https://arxiv.org/abs/2010.11929">“An image is worth 16x16 words: Transformers for image recognition at scale.”</a> by Kolesnikov, Alexander, et al. at ICLR 2021. <br /> <br /></p>
<h2 id="background">Background</h2>
<p>In last 10 years there has been significant development in Computer vision after development of Convolutional neural network(ConvNets). Albeit development of Convolutional neural networks dates back to 1980s when the idea of connectionism or parellel distributed processing (Rumelhart et al., 1986e; McClelland et. al., 1995) came into view during second wave of neural network research. Many ideas were revived from the work of pschologist Donald Hebb (Hebb, 1949). The main point in connectionism is that many simple computing units can achieve intelligent network behavior. An individual unit or small set of units is of little use. Due to connectionism, several concepts came to light during 1980s which remain in focus in today’s deep learning such as distributed representation, backpropogation to train deep neural network.<a href="https://proceedings.neurips.cc/paper/1989/file/53c3bce66e43be4f209556518c2fcb54-Paper.pdf"><sup>1</sup></a><br /><br />
After transformers where introduced in NLP in 2017 replacing recurrent neural network, many architectures involving self-attention(transformer model) has made progress in area of time series forecasting<a href="https://arxiv.org/abs/2002.06103"><sup>4</sup></a>, graph based model<a href="https://arxiv.org/abs/2106.03893"><sup>3</sup></a>, visual recognition system<a href="https://arxiv.org/abs/2010.11929"><sup>2</sup></a>. Below figure is from <a href="https://arxiv.org/pdf/2103.16775.pdf">Attention, Please! A Survey of Neural Attention Models in Deep Learning</a> represents developments of Attention in Deep learning.
<br /></p>
<p align="center">
<img src="https://iclr.iro.umontreal.ca/31b016ca-c229-4822-97e2-7300c3798230_1642191585/public/images/2021-09-01-Vision-Transformer/Attention_in_DL.png" alt="Attention_in_DL" />
<em>Figure 1: Timeline of work related to attention in Deep learning.</em>
</p>
<p>Convolutional networks have dominated in computer vision tasks. But there has been research around combining self-attention with CNN after success seen in NLP with transformer models. Below graph shows work related to transformers in vision models.</p>
<p align="center">
<img src="https://iclr.iro.umontreal.ca/31b016ca-c229-4822-97e2-7300c3798230_1642191585/public/images/2021-09-01-Vision-Transformer/related_work.png" alt="" />
<em>Figure 2: Related work with transformer in vision model.</em>
</p>
<h2 id="key-design-aspects-of-vit">Key Design Aspects of ViT</h2>
<p>Authors of ViT follows original transformer architecture (Vaswani et al., 2017) for image recognition.<br /></p>
<ul>
  <li>Input embedding is prepared by splitting image into fixed size patched(for example 16x16 or 14x14). Considering an Imagenet image of 224x224 and taking 14x14 patch which gives (224/14 x 224/14=)16x16 patch, which explains title of the paper where we treat each patch as token or word. This patches are then linearly projected and 1D positional embedding is added. This result is then fed to Transformer Encoder.<br /></li>
  <li>Models with small patch size is computationally more expensive as input sequence length size increases with lower patch size.<br /></li>
  <li>Why positional encoding is added? It retains positional Information and more details on 2D positional encoding is given in the paper.
<br /></li>
</ul>
<p align="center">
<img src="https://iclr.iro.umontreal.ca/31b016ca-c229-4822-97e2-7300c3798230_1642191585/public/images/2021-09-01-Vision-Transformer/Vit_architecture.png" alt="Vit architecture" />
<em>Figure 3: Model architecture.</em>
</p>
<p>As seen in above Figure 3, Transformer encoder consists of Multi-head attention layer and MLP layer one after the other. Normalization is applied before each layer since it helps to reduce number of steps needed by gradient descent to optimize the network and when we normalize scale of output is going to be same.<br /></p>
<h3 id="inductive-bias">Inductive bias</h3>
<p>ViT has significantly less image-specific inductive bias compared to the convolutional neural network. In ViT MLP layers are local and translational equivariant whereas self attention layers are global. Positional encoding in ViT induces inductive bias- It is shown in paper that pre-training is done on 224 dimension and fine-tuning is done on higher resolution i.e 384x384, so positional encoding of 16x16 patch is no more useful since number of patches increases and sequence becomes larger. So 2D interpolation is done for positional encoding and inductive bias is introduced.</p>
<h3 id="model-variants-and-results">Model variants and results</h3>
<p>There are 3 variants of ViT: ViT-Base, ViT-Large and ViT-huge. ViT-Large uses 16x16 patch and ViT-huge uses 14x14 patch.<br />
Baseline model considered is modified Resnet(BiT)-by replacing Batch Normalization layers with Group Normalization and standardized convolutions.<br /></p>
<p></p>
<h3 id="comparison-with-state-of-art">Comparison with State of Art</h3>
<p align="center">
<img src="https://iclr.iro.umontreal.ca/31b016ca-c229-4822-97e2-7300c3798230_1642191585/public/images/2021-09-01-Vision-Transformer/Performance.png" alt="Vit architecture" />
</p>
<p>Above figure shows performance of ViT variants with state of the art models. Results of ViT shown here was pretrained on JFT-300M dataset and it outperforms all the state of art models while take less computational resource on pre-training.</p>
<p align="center">
<img src="https://iclr.iro.umontreal.ca/31b016ca-c229-4822-97e2-7300c3798230_1642191585/public/images/2021-09-01-Vision-Transformer/VTAB.png" alt="Vit architecture" />
</p>
<p>VTAB benchmark result shows breakdown on 3 type of data:  Natural(1000 training examples per task, CIFAR), Specialized(medical and satellite images) and structured(tasks which require geometrical understanding) task groups and ViT outperformes previous SOTA.<br /></p>
<p></p>
<h3 id="training-data-size-matter">Training Data size matter</h3>
<p align="center">
<img src="https://iclr.iro.umontreal.ca/31b016ca-c229-4822-97e2-7300c3798230_1642191585/public/images/2021-09-01-Vision-Transformer/result1.png" />
</p>
<p>Vision Transform is trained on increasing size of datasets: ImageNet, ImageNet-21k and JFT-300 M. As shown in above figure when all the model variants pre-trained with ImageNet, ViT-large underperforms. And when it come to ImageNet-21k the performance of ViT-large is similar to BiT(Resnet). But with JFT-300 M, ViT-large model performs better. The BiT CNNs outperforms ViT on ImageNet, but with the larger dataset ViT over takes. They also did an experiment on the training model at random subsets of 9M, 30M and 90M and at last on full JFT-300M.<br /></p>
<h3 id="how-does-vision-transformer-process-the-image-data">How does Vision Transformer process the image data?</h3>
<p align="center">
<img src="https://iclr.iro.umontreal.ca/31b016ca-c229-4822-97e2-7300c3798230_1642191585/public/images/2021-09-01-Vision-Transformer/self_attention.png" />
</p>
<p>The first layer of the vision transformer lowers the dimension of the patch. So, Figure 4 (left) show the top principal component of the linear embedding filter.
Figure 4 (centre) shows that the model learns to encode distance within the image in the similarity of position embeddings, as you see the closer patches tend to have more similar position embeddings<br />
Self Attention allows integrating information across the entire image even in the lowest layers. So the Figure (right) shows the “attention distance” which is similar to a receptive field in CNNs. This is the figure of ViT-L/32 variant, where it has 24 layers and 16 attention heads for each layer. Some attention heads in the 0 layer have global attention and some have local attention.</p>
<h2 id="new-intuitions">New intuitions</h2>
<ul>
  <li>Transformers can be used in face recognition and fine-grain classification. <br /></li>
  <li>Author mention in paper that ‘Transformer appear not to saturate within the range tried, motivating future scaling efforts’ so in future it might be similar to GPT3 for computer vision tasks.</li>
  <li>Recent research on <a href="https://arxiv.org/pdf/2112.13492.pdf">Vision Transformer for small size datasets</a> improves performance for Tiny-Imagenet. Author of this paper proposes Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA), which effectively solves the lack of locality inductive bias.<br /></li>
</ul>
<p></p>
<h2 id="reference">Reference</h2>
<ul>
  <li>Kolesnikov, Alexander, et al. “An image is worth 16x16 words: Transformers for image recognition at scale.” (2021).<br /></li>
  <li>Correia, Alana de Santana, and Esther Luna Colombini. “Attention, please! A survey of Neural Attention Models in Deep Learning.” arXiv preprint arXiv:2103.16775 (2021).<br /></li>
  <li><a href="https://www.youtube.com/watch?v=j6kuz_NqkG0&amp;t=5s">Vision Transformer (ViT) - An image is worth 16x16 words(Paper Explained)</a></li>
</ul>

    <hr>
  </div>
  
</div>

<div class="pagination">
  
  <span class="pagination-item older">Older</span>
  
  
  <span class="pagination-item newer">Newer</span>
  
</div>

      </div>
    </div>

    <label for="sidebar-checkbox" class="sidebar-toggle"></label>

    <script src='/public/js/script.js'></script>
  </body>
</html>
