<!DOCTYPE html>
<html lang="en-us">

  <head>
  <link href="http://gmpg.org/xfn/11" rel="profile">
  <meta http-equiv="content-type" content="text/html; charset=utf-8">

  <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1">

  <title>
    
      BERT vs ALBERT explained &middot; The ICLR Blog Track
    
  </title>

  
  <link rel="canonical" href="https://iclr.iro.umontreal.ca/fe213953-21c6-4bda-8eb6-9d0e3543ae2d_1641910368/2021/12/01/albert/">
  

  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/fe213953-21c6-4bda-8eb6-9d0e3543ae2d_1641910368/public/css/poole.css">
  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/fe213953-21c6-4bda-8eb6-9d0e3543ae2d_1641910368/public/css/syntax.css">
  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/fe213953-21c6-4bda-8eb6-9d0e3543ae2d_1641910368/public/css/lanyon.css">
  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/fe213953-21c6-4bda-8eb6-9d0e3543ae2d_1641910368/public/css/custom.css">
  <link rel="stylesheet" href="https://fonts.googleapis.com/css?family=PT+Serif:400,400italic,700%7CPT+Sans:400">

  <link rel="apple-touch-icon-precomposed" sizes="144x144" href="https://iclr.iro.umontreal.ca/fe213953-21c6-4bda-8eb6-9d0e3543ae2d_1641910368/public/apple-touch-icon-precomposed.png">
  <link rel="shortcut icon" href="https://iclr.iro.umontreal.ca/fe213953-21c6-4bda-8eb6-9d0e3543ae2d_1641910368/public/favicon.ico">

  <link rel="alternate" type="application/rss+xml" title="RSS" href="https://iclr.iro.umontreal.ca/fe213953-21c6-4bda-8eb6-9d0e3543ae2d_1641910368/atom.xml">

  

  <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript" ></script>
 <!-- <script type="text/x-mathjax-config"> MathJax.Hub.Config({ TeX: { equationNumbers: { autoNumber: "AMS" } } }); </script> -->
  <script type="text/x-mathjax-config">
      MathJax.Hub.Config({
        tex2jax: { inlineMath: [ ['$','$'], ["\\(","\\)"] ],
         processEscapes: false
        }
      });
</script>
</head>


  <body>

    <!-- Target for toggling the sidebar `.sidebar-checkbox` is for regular
     styles, `#sidebar-checkbox` for behavior. -->
<input type="checkbox" class="sidebar-checkbox" id="sidebar-checkbox">

<!-- Toggleable sidebar -->
<div class="sidebar" id="sidebar">
  <div class="sidebar-item">
    <p>For short-term, peer-sourced tests of time, generalizations, specializations, reproductions, etc.!</p>
  </div>

  <nav class="sidebar-nav">
    <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/fe213953-21c6-4bda-8eb6-9d0e3543ae2d_1641910368/">Home</a>

    

    
    
      
        
      
    
      
        
      
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/fe213953-21c6-4bda-8eb6-9d0e3543ae2d_1641910368/about/">About</a>
        
      
    
      
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/fe213953-21c6-4bda-8eb6-9d0e3543ae2d_1641910368/submitting/">Submitting</a>
        
      
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/fe213953-21c6-4bda-8eb6-9d0e3543ae2d_1641910368/tags/">Tags</a>
        
      
    

    <a class="sidebar-nav-item" href="https://github.com/KandiSanjana/KandiSanjana.github.io">GitHub project</a>
    <span class="sidebar-nav-item">Currently vICLR Spring 2021</span>
  </nav>

  <div class="sidebar-item">
    <p>
      &copy; 2022. All rights reserved.
    </p>
  </div>
</div>


    <!-- Wrap is the content to shift when toggling the sidebar. We wrap the
         content to avoid any CSS collisions with our real content. -->
    <div class="wrap">
      <div class="masthead">
        <div class="container">
          <h3 class="masthead-title">
            <a href="/" title="Home">The ICLR Blog Track</a>
            <small></small>
          </h3>
        </div>
      </div>

      <div class="container content">
        <div class="post">
  <h1 id="iclr-post-title" class="post-title">BERT vs ALBERT explained</h1>
  <span class="post-date">01 Dec 2021 | 
    <a class="content-tag" href="/tags/#2021-12-01-albert"> 2021-12-01-albert </a>
  
    <a class="content-tag" href="/tags/#nlp"> NLP </a>
  
    <a class="content-tag" href="/tags/#machine-learning"> Machine Learning </a>
  
    <a class="content-tag" href="/tags/#scale"> Scale </a>
  
    <a class="content-tag" href="/tags/#bert"> BERT </a>
  
    <a class="content-tag" href="/tags/#albert"> ALBERT </a>
  </span>

  <span id="iclr-post-authors" class="post-date">Ramu, Sahana, Carnegie Mellon University; Kandi, Sanjana, Carnegie Mellon University</span>
  <h1 align="center">BERT vs ALBERT explained</h1>

<div align="center">
  
  <img src="/public/images/2021-12-01-albert/I1.png" alt="ALT" />
  
</div>

<h2 id="introduction">Introduction</h2>

<p>Implementing Machine Learning and Deep Learning models at scale require an immense amount of training time and computational resources. Particularly in the context of language representation learning, studies have shown that full network pre-training which is large is of crucial importance for achieving state-of-the-art performance. But, we know that increasing the model size results in an increase in the number of model parameters, which significantly increases the training and computation requirements. This can be a huge challenge in the domain of large scale computing. In this blog, we provide a brief summary of the ICLR paper “ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS.” This paper talks about two parameter reduction techniques to lower memory consumption and increase the training speed of the BERT (Bidirectional Encoder Representations from Transformers) architecture. The proposed methods in the paper led to models that scale much better compared to the original BERT.</p>

<h2 id="what-is-bert">What is BERT?</h2>
<p>We all know Google’s BERT has changed the NLP landscape, but what is it exactly?
BERT is one of the most famous natural language processing (NLP) frameworks used to help computers understand the meaning of text by using the surrounding text as context. BERT which stands for ‘<strong>B</strong>idirectional <strong>E</strong>ncoder <strong>R</strong>epresentations from <strong>T</strong>ransformers’ is built upon the concept of transformers where every output element is connected to every input element and their weights are dynamically calculated. In NLP, this process is commonly known as ‘Attention’.</p>

<h2 id="now-what-is-albert">Now… what is ALBERT?</h2>
<p>BERT is known for performing tasks ranging from simple text classification to complex tasks like Question Answering. While it seems like the perfect language model, this state-of-the-art architecture deals with millions if not billions of parameters which might significantly hamper training speed as we scale these models since communication overhead is directly proportional to the number of parameters. These issues are addressed by designing <strong>A</strong> <strong>L</strong>ite <strong>BERT</strong> (ALBERT) which is similar to the architecture of BERT, except for the fact that it deals with much lesser parameters. 
So, how exactly does ALBERT overcome this issue?
ALBERT incorporates two parameter reduction techniques in its implementation, which are: Factorized embedding parameterization and Cross-layer parameter sharing. Apart from these, self-supervised loss is also introduced for sentence-order prediction</p>

<p>Wondering what these mean? Let’s now dive into some details!</p>

<p>First, let’s look at the ALBERT model architecture. 
It is similar to that of BERT, that is, it uses a transformer encoder with GELU non-linearities.</p>

<div align="center">
  
  <img src="/public/images/2021-12-01-albert/I2.png" alt="ALT" />
  
</div>

<div align="center">
  <table>
    <thead>
      <tr>
        <th>Parameter</th>
        <th>Symbol</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Embedding Size</td>
        <td>E</td>
      </tr>
      <tr>
        <td>Number of Encoder Layers</td>
        <td>L</td>
      </tr>
      <tr>
        <td>Hidden size</td>
        <td>H</td>
      </tr>
      <tr>
        <td>Feed forward/filter size</td>
        <td>4H</td>
      </tr>
      <tr>
        <td>Number of attention heads</td>
        <td>H/64</td>
      </tr>
    </tbody>
  </table>
</div>

<p>Let us now look at how these parameter reduction techniques actually work.</p>

<h3 id="1-factorized-embedding-parameterization">1. Factorized embedding parameterization</h3>
<p>In BERT, the WordPiece embedding size E is the same as the hidden layer size H. This leads to suboptimal performance due to the following reasons:</p>
<ul>
  <li>NLP tasks require a very large vocabulary size, denoted by V. If the embedding size is equal to the hidden size H, then increasing H leads to increase in size of the embedding matrix, i.e., V X E. This leads to an increase in the number of parameters in the model to billions, hence circling back to our primary problem.</li>
  <li>WordPiece embeddings are meant to learn context-independent representations, whereas hidden-layer embeddings are meant to learn context-dependent representations.
BERT primarily uses context-dependent representations, which requires the hidden size H to be much greater than embedding size E. If H and E are tied together, increasing H will increase E, thereby increasing the total model parameters.</li>
</ul>

<p>Now, to combat this, ALBERT first decomposes the embedding parameters into two smaller matrices. First, the one-hot encoded vectors are projected into the lower dimensional embedding space of size E, and then projected to the hidden space of size H. We are therefore going from O(V × H) to O(V × E + E × H).
This is quite significant because it reduces the number of parameters when H»E.</p>

<div align="center">
  
  <img src="/public/images/2021-12-01-albert/I4.png" alt="ALT" />
  
</div>

<p>In the above table, we can see the performance of ALBERT based models with varying the embedding size E. We can see that non-shared embeddings (BERT style) perform better at higher E’s, but not by a significant margin. So for the expense of 1% reduction in accuracy in ALBERT, the number of parameters reduced is in the range 70-80M, which is a significant improvement from BERT. Out of all the E’s, 128 appears to perform better than the rest.</p>

<h3 id="2-cross-layer-parameter-sharing">2. Cross-layer parameter sharing</h3>
<p>The main purpose of parameter sharing is the radical reduction of parameters in a network. While the accuracy does slightly reduce by employing this method, the main goal of parameter reduction is achieved along with generalization of the model. While there are many ways to share parameters, ALBERT takes the default decision of sharing all parameters across layers. The performance of BERT and ALBERT can be compared by looking at the L2 and Cosine distances of the input and output embeddings of each layer as shown below.</p>

<div align="center">
  
  <img src="/public/images/2021-12-01-albert/I3.png" alt="ALT" />
  
</div>

<p>As we can see in the figure above, the transitions from layer to layer are much smoother for ALBERT than BERT. Hence, apart from just parameter reduction, parameter sharing across layers also stabilizes the parameters.</p>

<div align="center">
  
  <img src="/public/images/2021-12-01-albert/I5.png" alt="ALT" />
  
</div>

<p>The above table compares the ALBERT based models based on different configurations of parameter sharing. It considers embedding sizes E = 128 and E = 768. It is evident from the results that the not-shared (BERT-style) strategy performs the best, at the cost of a large number of parameters. The all-shared strategy (ALBERT-style) hurts the performance under both E’s, but the reduction is not severe compared to the not-shared strategy. Therefore, the all-shared strategy is better for this application and used as the default choice.</p>

<h3 id="3-inter-sentence-coherence-loss">3. Inter-sentence coherence loss</h3>
<p>In BERT, two types of losses are used, namely, Masked Language Modelling (MLM) loss and Next Sentence Prediction (NSP) loss. NSP loss is used to determine if two segments occur consecutively in a text. It was found that NSP loss is unreliable due to its lack of difficulty as a task. Therefore, in ALBERT, a new loss called sentence-order prediction (SOP) loss, focusing on inter sentence coherence was used. For positive samples, it uses two consecutive sentences from the same document and the same consecutive sentences with order swapped for negative examples. This helps to learn finer-grained distinctions about discourse-level coherence properties. Therefore, the ALBERT model performs better on multi-sentence encoding tasks.</p>

<div align="center">
  
  <img src="/public/images/2021-12-01-albert/I6.png" alt="ALT" />
  
</div>

<p>This table compares the results of additional inter-sentence loss. It takes into account no additional loss, as in XLNet- and RoBERTa-style, NSP (BERT-style) and SOP (ALBERT-style). The comparison is performed for both intrinsic and downstream tasks. We can see that SOP loss solves the NSP tasks well, and performs much better on SOP tasks. The downstream performance is much better with SOP loss for multi-sentence encoding tasks, providing an improvement of 1% on an average.</p>

<h2 id="how-do-these-two-compare">How do these two compare?</h2>

<h3 id="1-comparison-with-number-of-parameters">1. Comparison with number of parameters</h3>

<div align="center">
  
  <img src="/public/images/2021-12-01-albert/I7.png" alt="ALT" />
  
</div>

<p>Now that we have talked about the methods used for parameter reduction, let us actually compare BERT and ALBERT by looking at some numbers.
For example, ALBERT-large has about 18x lesser parameters compared to BERT-large which can be viewed as ALBERT having 18M parameters while BERT has 334M parameters!!
We could also look at it from another perspective by considering the hidden layer size. An ALBERT-xlarge configuration with H = 2048 has only 60M parameters and an ALBERT-xxlarge configuration with H = 4096 has 233M parameters, i.e., around 70% of BERT large’s parameters.</p>

<p>From the comparison above, it is obvious that ALBERT performs better than BERT! But as Machine Learning enthusiasts, it is always better to perform comparison with a couple of popular benchmark datasets such as GLUE, SQuAD and RACE.</p>

<h3 id="2-comparison-with-benchmarks">2. Comparison with benchmarks</h3>

<div align="center">
  
  <img src="/public/images/2021-12-01-albert/I8.png" alt="ALT" />
  
</div>

<p>ALBERT-xxlarge requires only 70% of the  BERT-large’s parameters, to achieve significant improvements over BERT-large. This improvement can be largely seen on RACE (+8.4%).</p>

<h3 id="3-comparison-with-training-time">3. Comparison with training time</h3>

<div align="center">
  
  <img src="/public/images/2021-12-01-albert/I9.png" alt="ALT" />
  
</div>

<p>The table compares the time of training vs the data throughput. Generally, longer training leads to better performance. So, here, training time is kept constant and data throughput is compared. We can see that ALBERT-xxlarge outperforms BERT-large in just 125k steps (32 hours), in comparison to BERT-large which takes 400k steps (34 hours) to achieve similar results. Here again, the most improvement can be seen on RACE (+5.2%)</p>

<p>The authors then decided to get their hands dirty and try out a few add-ons to improve the model! Let’s see what this is.</p>

<h2 id="additional-training-data-and-dropout-effects">Additional training data and dropout effects</h2>

<div align="center">
  
  <img src="/public/images/2021-12-01-albert/I10.png" alt="ALT" />
  
</div>

<p>Up until this point we have only considered 2 datasets, namely Wikipedia and BOOKCORPUS. But the figure above shows the performance when we add additional data used by both XLNet and RoBERTa. It is evident from the figure that adding data gives a significant boost to the dev set MLM accuracy.
But what is surprising is that even after training for 1M steps, the largest models do not overfit to their training data. So removing dropouts can further increase the capacity of the models which results in higher MLM accuracy as shown in the above figure. It is always said that adding combinations of batch normalization and dropout to CNNs can improve the model accuracy, but there is evidence which proves this theory wrong and shows that it may actually end up producing harmful results!</p>

<h2 id="conclusion">Conclusion</h2>

<p>ALBERT is successful in terms of reduction in the number of parameters by giving rise to powerful contextual representations, thereby giving significantly better results. However, due to its large structure, ALBERT is computationally more expensive than BERT. Many recent works have tackled this issue by including sparse and block attention.</p>

<p>That’s it folks! Hope this was a good and informative read.</p>

<h2 id="bibliography">Bibliography</h2>
<p>Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., &amp; Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.</p>

<p>Devlin, J., Chang, M. W., Lee, K., &amp; Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.</p>

<p>Medium. (2019. September 27). <em>Google’s ALBERT Is a Leaner BERT; Achieves SOTA on 3 NLP Benchmarks</em> https://medium.com/syncedreview/googles-albert-is-a-leaner-bert-achieves-sota-on-3-nlp-benchmarks-f64466dd583</p>

<p>Machinecurve. (2021. Januray 6). <em>ALBERT explained: A Lite BERT</em> https://www.machinecurve.com/index.php/2021/01/06/albert-explained-a-lite-bert/</p>

</div>

<div id="bibtex-container" class="related">
  For attribution in academic contexts, please cite this work as
  <pre id="bibtex-academic-attribution">

  </pre>

  BibTeX citation
  <pre id="bibtex-box">

  </pre>
</div>
<script>
  let authorsSpan = document.getElementById("iclr-post-authors");
  let authorsText = authorsSpan.textContent;
  let lnameFnameInstitution = authorsText.split(";");
  let lfiList = lnameFnameInstitution.map(lfi => lfi.split(",").map(item => item.trim()));
  let bibtexLFI = lfiList.map(lfi => lfi[0] + ", " + lfi[1]).join(" and ")
  let academicLFI = lfiList.map(lfi => lfi[0]);
  {
    if(academicLFI.length > 2) academicLFI = academicLFI[0] + ", et al.";
    else if(academicLFI.length == 2) academicLFI = academicLFI[0] + " & " + academicLFI[1];
    else academicLFI = academicLFI[0];
  }

  let titleSpan = document.getElementById("iclr-post-title");
  let titleText = titleSpan.textContent.trim();
  let bibtexTitleShorthand = (lfiList[0][1]+
    "2022"+
    titleText.split(" ").slice(0, 3).join("")
  ).replace(" ", "").replace(/[\p{P}$+<=>^`|~]/gu, '').toLowerCase().trim();

  let bibtexTemplate = `
@inproceedings{${bibtexTitleShorthand}},
  author = {${bibtexLFI}},
  title = {${titleText}},
  booktitle = {ICLR Blog Track},
  year = {2022},
  note = {${window.location.href}},
  url  = {${window.location.href}}
}
  `.trim();
  document.getElementById("bibtex-box").innerText = bibtexTemplate;

  let academicTemplate = `
${academicLFI}, "${titleText}", ICLR Blog Track, 2022.
`.trim();
  document.getElementById("bibtex-academic-attribution").innerText = academicTemplate;

</script>


<div class="related">
  <h2>Related posts</h2>
  <ul class="related-posts">
    
      <li>
        <h3>
          <a href="/2021/09/08/blog-posts-as-conference-contributions/">
            Blog Posts as Conference Contributions
            <small>08 Sep 2021 | 
    <a class="content-tag" href="/tags/#2021-12-01-albert"> 2021-12-01-albert </a>
  
    <a class="content-tag" href="/tags/#nlp"> NLP </a>
  
    <a class="content-tag" href="/tags/#machine-learning"> Machine Learning </a>
  
    <a class="content-tag" href="/tags/#scale"> Scale </a>
  
    <a class="content-tag" href="/tags/#bert"> BERT </a>
  
    <a class="content-tag" href="/tags/#albert"> ALBERT </a>
  </small>
          </a>
        </h3>
      </li>
    
      <li>
        <h3>
          <a href="/2020/04/03/even-has-latex/">
            Even has Latex!
            <small>03 Apr 2020 | 
    <a class="content-tag" href="/tags/#2021-12-01-albert"> 2021-12-01-albert </a>
  
    <a class="content-tag" href="/tags/#nlp"> NLP </a>
  
    <a class="content-tag" href="/tags/#machine-learning"> Machine Learning </a>
  
    <a class="content-tag" href="/tags/#scale"> Scale </a>
  
    <a class="content-tag" href="/tags/#bert"> BERT </a>
  
    <a class="content-tag" href="/tags/#albert"> ALBERT </a>
  </small>
          </a>
        </h3>
      </li>
    
      <li>
        <h3>
          <a href="/2020/04/02/example-content/">
            Example content (Images and Assets)
            <small>02 Apr 2020 | 
    <a class="content-tag" href="/tags/#2021-12-01-albert"> 2021-12-01-albert </a>
  
    <a class="content-tag" href="/tags/#nlp"> NLP </a>
  
    <a class="content-tag" href="/tags/#machine-learning"> Machine Learning </a>
  
    <a class="content-tag" href="/tags/#scale"> Scale </a>
  
    <a class="content-tag" href="/tags/#bert"> BERT </a>
  
    <a class="content-tag" href="/tags/#albert"> ALBERT </a>
  </small>
          </a>
        </h3>
      </li>
    
  </ul>
</div>


<script src="https://utteranc.es/client.js"
        repo="iclr-blog-track/iclr-blog-track.github.io"
        issue-term="pathname"
        label="utterance"
        theme="boxy-light"
        crossorigin="anonymous"
        >
</script>


      </div>
    </div>

    <label for="sidebar-checkbox" class="sidebar-toggle"></label>

    <script src='/public/js/script.js'></script>
  </body>
</html>
