Keywords: knowledge distillation, sparse parity, optimization, interpretability
Abstract: Knowledge distillation, where a student model learns from a teacher model, is a widely-adopted approach to improve the training of small models. A known challenge in distillation is that a large teacher-student performance gap can hurt the effectiveness of distillation, which prior works have aimed to mitigate by providing intermediate supervision.
In this work, we study a popular approach called _progressive distillation_, where several intermediate checkpoints of the teacher are used successively to supervise the student as it learns.
Using sparse parity as a testbed, we show empirically and theoretically that these intermediate checkpoints constitute an implicit curriculum that accelerates student learning.
This curriculum provides explicit supervision to learn underlying features used in the task, and, importantly, a fully trained teacher does not provide this supervision.
Submission Number: 43
Loading