Keywords: robust optimization, distillation, distributional robustness, long-tail learning
TL;DR: We explore the interplay between robust optimization objectives for the teacher and student in a knowledge distillation setting.
Abstract: Knowledge distillation is a popular technique that has been shown to produce remarkable gains in average accuracy. However, recent work has shown that these gains are not uniform across subgroups in the data, and can often come at the cost of accuracy on rare subgroups and classes. Robust optimization is a common remedy to improve worst-class accuracy in standard learning settings, but in distillation it is unknown whether it is best to apply robust objectives when training the teacher, the student, or both. This work studies the interplay between robust objectives for the teacher and student. Empirically, we show that that jointly modifying the teacher and student objectives can lead to better worst-class student performance and even Pareto improvement in the tradeoff between worst-class and overall performance. Theoretically, we show that the *per-class calibration* of teacher scores is key when training a robust student. Both the theory and experiments support the surprising finding that applying a robust teacher training objective does not always yield a more robust student.
Supplementary Material: pdf
Other Supplementary Material: zip