Self-Supervision Improves Multi-Teacher Distillation

ICLR 2026 Conference Submission13604 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: computer vision, self-supervised learning, multi teacher knowledge distillation
Abstract: Knowledge Distillation (KD) has evolved from compressing large models to enhancing the performance of models with the same capacity. Multi-teacher distillation extends this paradigm by amalgamating knowledge from multiple expert models into a single student. Multi-teacher knowledge distillation aims to create a powerful student model by amalgamating knowledge from multiple expert teachers. However, existing frameworks constrain the student to learn exclusively from the teachers' representations, overlooking valuable supervisory signals inherent in the data itself. In this work, we introduce Self-supervised Feature Aggregation (SeFA), a novel paradigm that addresses this limitation by synergistically combining multi-teacher distillation with self-supervised learning. SeFA formulates the training as a multi-task learning problem, optimizing the student's representations for both alignment with its teachers and performance on a data-driven, self-supervised task. We conduct extensive evaluations across a diverse set of tasks, including image classification, transfer learning, domain adaptation, image retrieval, and dense prediction. SeFA consistently outperforms state-of-the-art baselines, achieving average improvements of 6.11% on classification, 8.87% on image retrieval, and 6.44% on dense prediction tasks. Beyond these empirical gains, our comprehensive analysis demonstrates SeFA's robustness across various teacher combinations and architectures, establishing a more effective paradigm for multi-teacher knowledge distillation.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 13604
Loading