When do Curricula Work? (Wu, Dyer, and Neyshabur, 2021)

This post is a summary of When Do Curricula Work? (Wu, Dyer, and Neyshabur, 2021), a paper accepted to ICLR 2021 for an oral presentation.

Summary

By default, the data is presented to the neural network in random order. However, curriculum learning and anti-curriculum learning suggest modifying the order the examples are presented by their difficulty. Curriculum learning proposes to present easier examples earlier, whereas anti-curriculum learning proposes to present the harder examples earlier. This paper performs an empirical study of these ordered learning techniques on an image classification task and concludes that:

  • No curricula benefit final performance in the standard setting, but
  • Curriculum learning can help if training time is limited or the dataset is noisy

This paper may be interesting to you if you:

  • want to know if curriculum learning will benefit your model, or
  • need to choose a scoring function and a pacing function to define your curricula.

Curriculum Learning vs Anti-curriculum Learning

Defining a Curriculum

Although the idea behind curriculum learning and anti-curriculum learning is simple, there are many choices that could result in a different curriculum. We can define a curriculum through 3 components:

  • The scoring function $s(x)$, which scores the example $x$
  • The pacing function $g(t)$, which determines the size of the dataset at step $t$
  • The order

Before training, each example in the dataset is given a score through the scoring function. During training, at each step $t$, the pacing function determines the size of the dataset. Depending on the order (“curriculum” or “anti-curriculum”), the dataset for step $t$ consists of examples with $g(t)$ lowest or highest scored examples. We also allow “random” ordering to serve as a baseline. Note that the curricula with a random ordering is still paired with a pacing function and has varying dataset size over the training phase.

For the scoring function, the paper chooses the c-score scoring function by Jiang et al., 2021, which quantifies how well the model could predict the example’s label when trained on a dataset without that example. Other ways to score an example might be to use the loss or to use the index of the epoch where the model first predicted the example correctly. However, experiments show that these 3 scoring functions are highly correlated anyways on both VGG-11 and ResNet-18, so only the c-score scoring function is used.

There are infinitely many valid pacing functions, as all we need is a monotonic function. This paper experiments with 6 families of pacing functions: logarithmic, exponential, step, linear, quadratic, and root. There are also two important parameters: the fraction of training steps needed before using the full dataset ($a$) and the fraction of the dataset used at the beginning of training ($b$). With 6 different values of $a$ (0.01, 0.1, 0.2, 0.4, 0.8, 1.6) and 5 different values of $b$ (0.0025, 0.1, 0.2, 0.4, 0.8), each family has 30 different combinations of parameters, resulting in a total of 180 pacing functions tested.

Different families of pacing functions
Plots of different pacing functions and their equations. Figure 4 from this paper.

Standard Setting

To test ordered learning, a ResNet-50 model was trained on the CIFAR10 and CIFAR100 datasets for 100 epochs. Each combination of 180 pacing functions and the 3 orders (curriculum, anti-curriculum, and random) were tested, and the best out of 3 random seeds were used for each combination.

The paper defines 3 baselines to evaluate the runs. The standard1 baseline is the mean performance of all 540 runs. The standard2 baseline is the mean of 180 maximums from 180 groups of 3 and represents a hyperparameter sweep. The standard3 baseline is the mean of the top three values of 540 runs.

Experiments show that all three orderings show similar performance, which suggests that the benefit comes from the dynamic dataset size induced by the pacing function. However, even this benefit is marginal, as it does not significantly outperform the standard2 baseline that considers the large-scale hyperparameter sweep performed.

Experiment results on standard setting
Experiment results on the standard setting on CIFAR10 and CIFAR100. (a) shows bar plots for the best mean accuracy for each method with the 3 baselines. (b) shows accuracies of all 180 configurations averaged over 3 random seeds. The solid black line denotes the mean, dashed lines denote standard deviation, and the orange line denotes the standard2 baseline. Figure 5 from this paper.

Time-limited Setting

For the time-limited setting, the same experiments are performed but with 1, 5, or 50 epochs (253, 1760, 17600 steps) instead of 100 epochs (35200 steps). As the number of total steps decreases, curriculum learning shows higher performance gains. The pacing function also seems to help performance, as all three ordered learning methods show at least comparable performance to the standard3 baseline.

Experiment results on CIFAR10 for time-limited setting Experiment results on CIFAR100 for time-limited setting
Experiment results on the time-limited setting on CIFAR10 and CIFAR100. Figures 6 and 17 from this paper.

Noisy Label Setting

To test ordered learning in the noisy setting, artificial label noise was added by random permuting labels. Experiments were done with the same setup but with 20%, 40%, 60%, and 80% label noise, and with recomputed c-scores. Again, curriculum learning clearly outperforms other methods in all noise levels.

Experiment results on CIFAR100 for noisy label setting
Experiment results on the noisy label setting on CIFAR100. Figure 7 from this paper.

Conclusion

Curriculum learning only helps performance if training time is limited or if the dataset contains noisy labels. This reflects the practice where ordered learning is not a standard practice in supervised image classification but is used when training generalized language models.

Please read the paper if you want to learn more about:

  • Implicit curricula: Examples are learned in a consistent order given that the order in which examples are presented during training is fixed
  • Correlations between different scoring functions and different pacing functions
  • More analysis on the pacing functions and the c-scores in the noisy label setting
  • More experiments on FOOD101 and FOOD101N dataset

Some other relevant papers that could be interesting to read are: