Rethinking Knowledge Distillation: A Mixture-of-Experts Perspective

ICLR 2025 Conference Submission4895 Authors

25 Sept 2024 (modified: 13 Oct 2024)ICLR 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Knowledge Distillation, Mixture-of-Experts
Abstract: Knowledge distillation (KD) aims to transfer useful information from a large-scale model (teacher) to a lightweight model (student). Classical KD focuses on leveraging the teacher's predictions as soft labels to regularize student training. However, the exact match of predictions in Kullback-Leibler (KL) divergence could be somewhat in conflict with the classification objective, given that the distribution discrepancies between teacher-generated predictions and ground-truth annotations tend to be fairly severe. In this paper, we rethink the role of teacher predictions from a Mixture-of-Experts (MoE) perspective and transfer knowledge by introducing teacher predictions as latent variables to reformulate the classification objective. This MoE strategy results in breaking down the vanilla classification task into a mixture of easier subtasks with the teacher classifier as a gating function to weigh the importance of subtasks. Each subtask is efficiently conquered by distinct experts that are effectively implemented by resorting to multi-level teacher outputs. We further develop a theoretical framework to formulate our method, termed MoE-KD, as an Expectation-Maximization (EM) algorithm and provide proof of the convergence. Extensive experiments manifest that MoE-KD outperforms advanced knowledge distillers on mainstream benchmarks.
Primary Area: other topics in machine learning (i.e., none of the above)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4895
Loading