- TL;DR: We present a novel framework of Knowledge Distillation utilizing peer samples as the teacher
- Abstract: Knowledge Distillation (KD) is a widely used technique in recent deep learning research to obtain small and simple models whose performance is on a par with their large and complex counterparts. Standard Knowledge Distillation tends to be time-consuming because of the training time spent to obtain a teacher model that would then provide guidance for the student model. It might be possible to cut short the time by training a teacher model on the fly, but it is not trivial to have such a high-capacity teacher that gives quality guidance to student models this way. To improve this, we present a novel framework of Knowledge Distillation exploiting dark knowledge from the whole training set. In this framework, we propose a simple and effective implementation named Distillation by Utilizing Peer Samples (DUPS) in one generation. We verify our algorithm on numerous experiments. Compared with standard training on modern architectures, DUPS achieves an average improvement of 1%-2% on various tasks with nearly zero extra cost. Considering some typical Knowledge Distillation methods which are much more time-consuming, we also get comparable or even better performance using DUPS.
- Original Pdf: pdf