When do Curricula Work? (Wu, Dyer, and Neyshabur, 2021)
01 Dec 2021 | machine learning deep learning curriculum learningThis post is a summary of When Do Curricula Work? (Wu, Dyer, and Neyshabur, 2021), a paper accepted to ICLR 2021 for an oral presentation.
Summary
By default, the data is presented to the neural network in random order. However, curriculum learning and anti-curriculum learning suggest modifying the order the examples are presented by their difficulty. Curriculum learning proposes to present easier examples earlier, whereas anti-curriculum learning proposes to present the harder examples earlier. This paper performs an empirical study of these ordered learning techniques on an image classification task and concludes that:
- No curricula benefit final performance in the standard setting, but
- Curriculum learning can help if training time is limited or the dataset is noisy
This paper may be interesting to you if you:
- want to know if curriculum learning will benefit your model, or
- need to choose a scoring function and a pacing function to define your curricula.
Defining a Curriculum
Although the idea behind curriculum learning and anti-curriculum learning is simple, there are many choices that could result in a different curriculum. We can define a curriculum through 3 components:
- The scoring function $s(x)$, which scores the example $x$
- The pacing function $g(t)$, which determines the size of the dataset at step $t$
- The order
Before training, each example in the dataset is given a score through the scoring function. During training, at each step $t$, the pacing function determines the size of the dataset. Depending on the order (“curriculum” or “anti-curriculum”), the dataset for step $t$ consists of examples with $g(t)$ lowest or highest scored examples. We also allow “random” ordering to serve as a baseline. Note that the curricula with a random ordering is still paired with a pacing function and has varying dataset size over the training phase.
For the scoring function, the paper chooses the c-score scoring function by Jiang et al., 2021, which quantifies how well the model could predict the example’s label when trained on a dataset without that example. Other ways to score an example might be to use the loss or to use the index of the epoch where the model first predicted the example correctly. However, experiments show that these 3 scoring functions are highly correlated anyways on both VGG-11 and ResNet-18, so only the c-score scoring function is used.
There are infinitely many valid pacing functions, as all we need is a monotonic function. This paper experiments with 6 families of pacing functions: logarithmic, exponential, step, linear, quadratic, and root. There are also two important parameters: the fraction of training steps needed before using the full dataset ($a$) and the fraction of the dataset used at the beginning of training ($b$). With 6 different values of $a$ (0.01, 0.1, 0.2, 0.4, 0.8, 1.6) and 5 different values of $b$ (0.0025, 0.1, 0.2, 0.4, 0.8), each family has 30 different combinations of parameters, resulting in a total of 180 pacing functions tested.

Standard Setting
To test ordered learning, a ResNet-50 model was trained on the CIFAR10 and CIFAR100 datasets for 100 epochs. Each combination of 180 pacing functions and the 3 orders (curriculum, anti-curriculum, and random) were tested, and the best out of 3 random seeds were used for each combination.
The paper defines 3 baselines to evaluate the runs. The standard1 baseline is the mean performance of all 540 runs. The standard2 baseline is the mean of 180 maximums from 180 groups of 3 and represents a hyperparameter sweep. The standard3 baseline is the mean of the top three values of 540 runs.
Experiments show that all three orderings show similar performance, which suggests that the benefit comes from the dynamic dataset size induced by the pacing function. However, even this benefit is marginal, as it does not significantly outperform the standard2 baseline that considers the large-scale hyperparameter sweep performed.

Time-limited Setting
For the time-limited setting, the same experiments are performed but with 1, 5, or 50 epochs (253, 1760, 17600 steps) instead of 100 epochs (35200 steps). As the number of total steps decreases, curriculum learning shows higher performance gains. The pacing function also seems to help performance, as all three ordered learning methods show at least comparable performance to the standard3 baseline.


Noisy Label Setting
To test ordered learning in the noisy setting, artificial label noise was added by random permuting labels. Experiments were done with the same setup but with 20%, 40%, 60%, and 80% label noise, and with recomputed c-scores. Again, curriculum learning clearly outperforms other methods in all noise levels.

Conclusion
Curriculum learning only helps performance if training time is limited or if the dataset contains noisy labels. This reflects the practice where ordered learning is not a standard practice in supervised image classification but is used when training generalized language models.
Please read the paper if you want to learn more about:
- Implicit curricula: Examples are learned in a consistent order given that the order in which examples are presented during training is fixed
- Correlations between different scoring functions and different pacing functions
- More analysis on the pacing functions and the c-scores in the noisy label setting
- More experiments on FOOD101 and FOOD101N dataset
Some other relevant papers that could be interesting to read are:
- Exploring the Memorization-Generalization Continuum in Deep Learning (Jiang et al., 2021) defines the consistency score (C-score) used for the scoring function for the curriculum in this paper.
- On the Role of Corpus Ordering in Language Modeling (Agrawal et al., 2021) perform similar experiments with curriculum learning on pretraining language models. The authors conclude that curriculum learning can show “consistent improvement gains over conventional vanilla training.” This supports this post’s conclusion as language models are often trained under a limited computational budget with respect to the size of the dataset.