One Generation Knowledge Distillation by Utilizing Peer SamplesDownload PDF

Xingjian Li, Haozhe An, Haoyi Xiong, Jun Huan, Dejing Dou, Chengzhong Xu

25 Sep 2019 (modified: 24 Dec 2019)ICLR 2020 Conference Withdrawn SubmissionReaders: Everyone
  • Original Pdf: pdf
  • TL;DR: We present a novel framework of Knowledge Distillation utilizing peer samples as the teacher
  • Abstract: Knowledge Distillation (KD) is a widely used technique in recent deep learning research to obtain small and simple models whose performance is on a par with their large and complex counterparts. Standard Knowledge Distillation tends to be time-consuming because of the training time spent to obtain a teacher model that would then provide guidance for the student model. It might be possible to cut short the time by training a teacher model on the fly, but it is not trivial to have such a high-capacity teacher that gives quality guidance to student models this way. To improve this, we present a novel framework of Knowledge Distillation exploiting dark knowledge from the whole training set. In this framework, we propose a simple and effective implementation named Distillation by Utilizing Peer Samples (DUPS) in one generation. We verify our algorithm on numerous experiments. Compared with standard training on modern architectures, DUPS achieves an average improvement of 1%-2% on various tasks with nearly zero extra cost. Considering some typical Knowledge Distillation methods which are much more time-consuming, we also get comparable or even better performance using DUPS.
7 Replies