KD-Zero: Evolving Knowledge Distiller for Any Teacher-Student Pairs

Published: 21 Sept 2023, Last Modified: 14 Jan 2024NeurIPS 2023 posterEveryoneRevisionsBibTeX
Keywords: Knowledge distillation
TL;DR: We present KD-Zero, the first auto-search framework for evolving best distiller from scratch to alleviate teacher-student gaps.
Abstract: Knowledge distillation (KD) has emerged as an effective technique for compressing models that can enhance the lightweight model. Conventional KD methods propose various designs to allow student model to imitate the teacher better. However, these handcrafted KD designs heavily rely on expert knowledge and may be sub-optimal for various teacher-student pairs. In this paper, we present a novel framework, KD-Zero, which utilizes evolutionary search to automatically discover promising distiller from scratch for any teacher-student architectures. Specifically, we first decompose the generalized distiller into knowledge transformations, distance functions, and loss weights. Then, we construct our distiller search space by selecting advanced operations for these three components. With sharpness and represent gap as fitting objectives, we evolve candidate populations and generate better distillers by crossover and mutation. To ensure efficient searching, we employ the loss-rejection protocol, search space shrinkage, and proxy settings during the search process. In this manner, the discovered distiller can address the capacity gap and cross-architecture challenges for any teacher-student pairs in the final distillation stage. Comprehensive experiments reveal that KD-Zero consistently outperforms other state-of-the-art methods across diverse architectures on classification, detection, and segmentation tasks. Noticeably, we provide some practical insights in designing the distiller by analyzing the distiller discovered. Codes are available in supplementary materials.
Supplementary Material: pdf
Submission Number: 8
Loading