<!DOCTYPE html>
<html lang="en-us">

  <head>
  <link href="http://gmpg.org/xfn/11" rel="profile">
  <meta http-equiv="content-type" content="text/html; charset=utf-8">

  <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1">

  <title>
    
      When do Curricula Work? (Wu, Dyer, and Neyshabur, 2021) &middot; The ICLR Blog Track
    
  </title>

  
  <link rel="canonical" href="https://iclr.iro.umontreal.ca/36ba5402-7a2f-4ee4-bcf9-4c8ae9351da9_1642243108/2021/12/01/when-do-curricula-work/">
  

  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/36ba5402-7a2f-4ee4-bcf9-4c8ae9351da9_1642243108/public/css/poole.css">
  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/36ba5402-7a2f-4ee4-bcf9-4c8ae9351da9_1642243108/public/css/syntax.css">
  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/36ba5402-7a2f-4ee4-bcf9-4c8ae9351da9_1642243108/public/css/lanyon.css">
  <link rel="stylesheet" href="https://iclr.iro.umontreal.ca/36ba5402-7a2f-4ee4-bcf9-4c8ae9351da9_1642243108/public/css/custom.css">
  <link rel="stylesheet" href="https://fonts.googleapis.com/css?family=PT+Serif:400,400italic,700%7CPT+Sans:400">

  <link rel="apple-touch-icon-precomposed" sizes="144x144" href="https://iclr.iro.umontreal.ca/36ba5402-7a2f-4ee4-bcf9-4c8ae9351da9_1642243108/public/apple-touch-icon-precomposed.png">
  <link rel="shortcut icon" href="https://iclr.iro.umontreal.ca/36ba5402-7a2f-4ee4-bcf9-4c8ae9351da9_1642243108/public/favicon.ico">

  <link rel="alternate" type="application/rss+xml" title="RSS" href="https://iclr.iro.umontreal.ca/36ba5402-7a2f-4ee4-bcf9-4c8ae9351da9_1642243108/atom.xml">

  

  <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript" ></script>
 <!-- <script type="text/x-mathjax-config"> MathJax.Hub.Config({ TeX: { equationNumbers: { autoNumber: "AMS" } } }); </script> -->
  <script type="text/x-mathjax-config">
      MathJax.Hub.Config({
        tex2jax: { inlineMath: [ ['$','$'], ["\\(","\\)"] ],
         processEscapes: false
        }
      });
</script>
</head>


  <body>

    <!-- Target for toggling the sidebar `.sidebar-checkbox` is for regular
     styles, `#sidebar-checkbox` for behavior. -->
<input type="checkbox" class="sidebar-checkbox" id="sidebar-checkbox">
<!-- <input type="checkbox" class="sidebar-checkbox" id="sidebar-checkbox" > -->

<!-- Toggleable sidebar -->
<div class="sidebar" id="sidebar">
  <div class="sidebar-item">
    <p>For short-term, peer-sourced tests of time, generalizations, specializations, reproductions, etc.!</p>
  </div>

  <nav class="sidebar-nav">

    

    
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/36ba5402-7a2f-4ee4-bcf9-4c8ae9351da9_1642243108/">ICLR 2022 Blog Track</a>
        
      
    
      
        
      
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/36ba5402-7a2f-4ee4-bcf9-4c8ae9351da9_1642243108/about/">About</a>
        
      
    
      
    
      
        
      
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/36ba5402-7a2f-4ee4-bcf9-4c8ae9351da9_1642243108/submitting/">Submitting</a>
        
      
    
      
        
          <a class="sidebar-nav-item" href="https://iclr.iro.umontreal.ca/36ba5402-7a2f-4ee4-bcf9-4c8ae9351da9_1642243108/tags/">Tags</a>
        
      
    

    <a class="sidebar-nav-item" href="https://github.com/iclr-blog-track/iclr-blog-track.github.io">GitHub project</a>
    <span class="sidebar-nav-item">Currently vICLR Spring 2021</span>
  </nav>

  <div class="sidebar-item">
    <p>
      &copy; 2022. All rights reserved.
    </p>
  </div>
</div>


    <!-- Wrap is the content to shift when toggling the sidebar. We wrap the
         content to avoid any CSS collisions with our real content. -->
    <div class="wrap">
      <div class="masthead">
        <div class="container">
          <h3 class="masthead-title">
            <a href="/" title="Home">The ICLR Blog Track</a>
            <small></small>
          </h3>
        </div>
      </div>

      <div class="container content">
        <div class="post">
  <h1 id="iclr-post-title" class="post-title">When do Curricula Work? (Wu, Dyer, and Neyshabur, 2021)</h1>
  <span class="post-date">01 Dec 2021 | 
    <a class="content-tag" href="/tags/#machine-learning"> machine learning </a>
  
    <a class="content-tag" href="/tags/#deep-learning"> deep learning </a>
  
    <a class="content-tag" href="/tags/#curriculum-learning"> curriculum learning </a>
  </span>

  <span id="iclr-post-authors" class="post-date">Anonymous</span>
  <p>This post is a summary of <a href="https://openreview.net/forum?id=tW4QEInpni">When Do Curricula Work?</a> (Wu, Dyer, and Neyshabur, 2021), a paper accepted to ICLR 2021 for an oral presentation.</p>

<h2 id="summary">Summary</h2>

<p>By default, the data is presented to the neural network in random order. However, curriculum learning and anti-curriculum learning suggest modifying the order the examples are presented by their difficulty. Curriculum learning proposes to present easier examples earlier, whereas anti-curriculum learning proposes to present the harder examples earlier. This paper performs an empirical study of these ordered learning techniques on an image classification task and concludes that:</p>
<ul>
  <li>No curricula benefit final performance in the standard setting, but</li>
  <li>Curriculum learning can help if training time is limited or the dataset is noisy</li>
</ul>

<p>This paper may be interesting to you if you:</p>
<ul>
  <li>want to know if curriculum learning will benefit your model, or</li>
  <li>need to choose a scoring function and a pacing function to define your curricula.</li>
</ul>

<p align="center">
  <img src="/public/images/2021-12-01-when-do-curricula-work/cl_vs_acl.png" alt="Curriculum Learning vs Anti-curriculum Learning" />
</p>

<h2 id="defining-a-curriculum">Defining a Curriculum</h2>

<p>Although the idea behind curriculum learning and anti-curriculum learning is simple, there are many choices that could result in a different curriculum. We can define a curriculum through 3 components:</p>
<ul>
  <li>The scoring function $s(x)$, which scores the example $x$</li>
  <li>The pacing function $g(t)$, which determines the size of the dataset at step $t$</li>
  <li>The order</li>
</ul>

<p>Before training, each example in the dataset is given a score through the scoring function.
During training, at each step $t$, the pacing function determines the size of the dataset.
Depending on the order (“curriculum” or “anti-curriculum”), the dataset for step $t$ consists of examples with $g(t)$ lowest or highest scored examples.
We also allow “random” ordering to serve as a baseline.
Note that the curricula with a random ordering is still paired with a pacing function and has varying dataset size over the training phase.</p>

<p>For the scoring function, the paper chooses the c-score scoring function by <a href="https://proceedings.mlr.press/v139/jiang21k.html">Jiang et al., 2021</a>, which quantifies how well the model could predict the example’s label when trained on a dataset without that example. Other ways to score an example might be to use the loss or to use the index of the epoch where the model first predicted the example correctly. However, experiments show that these 3 scoring functions are highly correlated anyways on both VGG-11 and ResNet-18, so only the c-score scoring function is used.</p>

<p>There are infinitely many valid pacing functions, as all we need is a monotonic function. This paper experiments with 6 families of pacing functions: logarithmic, exponential, step, linear, quadratic, and root. There are also two important parameters: the fraction of training steps needed before using the full dataset ($a$) and the fraction of the dataset used at the beginning of training ($b$). With 6 different values of $a$ (0.01, 0.1, 0.2, 0.4, 0.8, 1.6) and 5 different values of $b$ (0.0025, 0.1, 0.2, 0.4, 0.8), each family has 30 different combinations of parameters, resulting in a total of 180 pacing functions tested.</p>

<figure style="text-align: center;">
  <a href="/public/images/2021-12-01-when-do-curricula-work/pacing_functions.png">
    <img style="margin: 0 auto;" src="/public/images/2021-12-01-when-do-curricula-work/pacing_functions.png" alt="Different families of pacing functions" />
  </a>
  <figcaption>
  Plots of different pacing functions and their equations. Figure 4 from this paper.
  </figcaption>
</figure>

<h2 id="standard-setting">Standard Setting</h2>

<p>To test ordered learning, a ResNet-50 model was trained on the CIFAR10 and CIFAR100 datasets for 100 epochs. Each combination of 180 pacing functions and the 3 orders (curriculum, anti-curriculum, and random) were tested, and the best out of 3 random seeds were used for each combination.</p>

<p>The paper defines 3 baselines to evaluate the runs. The <em>standard1</em> baseline is the mean performance of all 540 runs. The <em>standard2</em> baseline is the mean of 180 maximums from 180 groups of 3 and represents a hyperparameter sweep. The <em>standard3</em> baseline is the mean of the top three values of 540 runs.</p>

<p>Experiments show that all three orderings show similar performance, which suggests that the benefit comes from the dynamic dataset size induced by the pacing function. However, even this benefit is marginal, as it does not significantly outperform the <em>standard2</em> baseline that considers the large-scale hyperparameter sweep performed.</p>

<figure style="text-align: center;">
  <img style="margin: 0 auto;" src="/public/images/2021-12-01-when-do-curricula-work/standard_setting.png" alt="Experiment results on standard setting" />
  <figcaption>
  Experiment results on the standard setting on CIFAR10 and CIFAR100. (a) shows bar plots for the best mean accuracy for each method with the 3 baselines. (b) shows accuracies of all 180 configurations averaged over 3 random seeds. The solid black line denotes the mean, dashed lines denote standard deviation, and the orange line denotes the <em>standard2</em> baseline. Figure 5 from this paper.
  </figcaption>
</figure>

<h2 id="time-limited-setting">Time-limited Setting</h2>

<p>For the time-limited setting, the same experiments are performed but with 1, 5, or 50 epochs (253, 1760, 17600 steps) instead of 100 epochs (35200 steps). As the number of total steps decreases, curriculum learning shows higher performance gains. The pacing function also seems to help performance, as all three ordered learning methods show at least comparable performance to the <em>standard3</em> baseline.</p>

<figure style="text-align: center;">
  <img style="margin: 0 auto;" src="/public/images/2021-12-01-when-do-curricula-work/time_limited_setting_cifar10.png" alt="Experiment results on CIFAR10 for time-limited setting" />
  <img style="margin: 0 auto;" src="/public/images/2021-12-01-when-do-curricula-work/time_limited_setting_cifar100.png" alt="Experiment results on CIFAR100 for time-limited setting" />
  <figcaption>
  Experiment results on the time-limited setting on CIFAR10 and CIFAR100. Figures 6 and 17 from this paper.
  </figcaption>
</figure>

<h2 id="noisy-label-setting">Noisy Label Setting</h2>

<p>To test ordered learning in the noisy setting, artificial label noise was added by random permuting labels. Experiments were done with the same setup but with 20%, 40%, 60%, and 80% label noise, and with recomputed c-scores. Again, curriculum learning clearly outperforms other methods in all noise levels.</p>

<figure style="text-align: center;">
  <img style="margin: 0 auto;" src="/public/images/2021-12-01-when-do-curricula-work/noisy_setting.png" alt="Experiment results on CIFAR100 for noisy label setting" />
  <figcaption>
  Experiment results on the noisy label setting on CIFAR100. Figure 7 from this paper.
  </figcaption>
</figure>

<h2 id="conclusion">Conclusion</h2>

<p>Curriculum learning only helps performance if training time is limited or if the dataset contains noisy labels. This reflects the practice where ordered learning is not a standard practice in supervised image classification but is used when training generalized language models.</p>

<p>Please read the paper if you want to learn more about:</p>
<ul>
  <li>Implicit curricula: Examples are learned in a consistent order given that the order in which examples are presented during training is fixed</li>
  <li>Correlations between different scoring functions and different pacing functions</li>
  <li>More analysis on the pacing functions and the c-scores in the noisy label setting</li>
  <li>More experiments on FOOD101 and FOOD101N dataset</li>
</ul>

<p>Some other relevant papers that could be interesting to read are:</p>
<ul>
  <li><a href="https://proceedings.mlr.press/v139/jiang21k.html">Exploring the Memorization-Generalization Continuum in Deep Learning (Jiang et al., 2021)</a> defines the consistency score (C-score) used for the scoring function for the curriculum in this paper.</li>
  <li><a href="https://aclanthology.org/2021.sustainlp-1.15/">On the Role of Corpus Ordering in Language Modeling (Agrawal et al., 2021)</a> perform similar experiments with curriculum learning on pretraining language models. The authors conclude that curriculum learning can show “consistent improvement gains over conventional vanilla training.” This supports this post’s conclusion as language models are often trained under a limited computational budget with respect to the size of the dataset.</li>
</ul>

</div>

<div id="bibtex-container" class="related">
  For attribution in academic contexts, please cite this work as
  <pre id="bibtex-academic-attribution">

  </pre>

  BibTeX citation
  <pre id="bibtex-box">

  </pre>
</div>
<script>
  let authorsSpan = document.getElementById("iclr-post-authors");
  let authorsText = authorsSpan.textContent;
  let lnameFnameInstitution = authorsText.split(";");
  let lfiList = lnameFnameInstitution.map(lfi => lfi.split(",").map(item => item.trim()));
  let bibtexLFI = lfiList.map(lfi => lfi[0] + ", " + lfi[1]).join(" and ")
  let academicLFI = lfiList.map(lfi => lfi[0]);
  {
    if(academicLFI.length > 2) academicLFI = academicLFI[0] + ", et al.";
    else if(academicLFI.length == 2) academicLFI = academicLFI[0] + " & " + academicLFI[1];
    else academicLFI = academicLFI[0];
  }

  let titleSpan = document.getElementById("iclr-post-title");
  let titleText = titleSpan.textContent.trim();
  let bibtexTitleShorthand = (lfiList[0][1]+
    "2022"+
    titleText.split(" ").slice(0, 3).join("")
  ).replace(" ", "").replace(/[\p{P}$+<=>^`|~]/gu, '').toLowerCase().trim();

  let bibtexTemplate = `
@inproceedings{${bibtexTitleShorthand}},
  author = {${bibtexLFI}},
  title = {${titleText}},
  booktitle = {ICLR Blog Track},
  year = {2022},
  note = {${window.location.href}},
  url  = {${window.location.href}}
}
  `.trim();
  document.getElementById("bibtex-box").innerText = bibtexTemplate;

  let academicTemplate = `
${academicLFI}, "${titleText}", ICLR Blog Track, 2022.
`.trim();
  document.getElementById("bibtex-academic-attribution").innerText = academicTemplate;

</script>


<div class="related">
  <h2>Related posts</h2>
  <ul class="related-posts">
    
      <li>
        <h3>
          <a href="/2021/09/01/sample-submission/">
            Sample Submission
            <small>01 Sep 2021 | 
    <a class="content-tag" href="/tags/#machine-learning"> machine learning </a>
  
    <a class="content-tag" href="/tags/#deep-learning"> deep learning </a>
  
    <a class="content-tag" href="/tags/#curriculum-learning"> curriculum learning </a>
  </small>
          </a>
        </h3>
      </li>
    
      <li>
        <h3>
          <a href="/2020/04/02/example-content/">
            Example content (Basic Markdown)
            <small>02 Apr 2020 | 
    <a class="content-tag" href="/tags/#machine-learning"> machine learning </a>
  
    <a class="content-tag" href="/tags/#deep-learning"> deep learning </a>
  
    <a class="content-tag" href="/tags/#curriculum-learning"> curriculum learning </a>
  </small>
          </a>
        </h3>
      </li>
    
  </ul>
</div>


<script src="https://utteranc.es/client.js"
        repo="iclr-blog-track/iclr-blog-track.github.io"
        issue-term="pathname"
        label="utterance"
        theme="boxy-light"
        crossorigin="anonymous"
        >
</script>


      </div>
    </div>

    <label for="sidebar-checkbox" class="sidebar-toggle"></label>

    <script src='/public/js/script.js'></script>
  </body>
</html>
