Abstract: Knowledge distillation (KD), which involves training a smaller student model to approximate the predictions of a larger teacher model is useful in striking a balance between model accuracy and computational constraints. However, KD has been found to be ineffective when the teacher and student models have a significant capacity gap. In this work, we address this issue via "meta-collaborative distillation" (MC-Distil), where students of varying capacities collaborate during distillation. Using a "coordinator" network (C-Net), MC-Distil enables mutual learning among students as a meta-learning task. Our insight is that C-Net learns from each student’s performance and training instance characteristics, allowing students of different capacities to improve together. Our method enhances student accuracy for all students, surpassing state-of-the-art baselines, including multi-step distillation, consensus enforcement, and teacher re-training. We achieve average gains of 2.5% on CIFAR100 and 2% on TinyImageNet datasets, consistently across diverse student sizes, teacher sizes, and architectures. Notably, larger students benefiting through meta-collaboration with smaller students is a novel idea. MC-Distil excels in training superior student models under real-world conditions such as label noise and domain adaptation.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Hongsheng_Li3
Submission Number: 4706
Loading