Keywords: Knowledge Distillation, Mixture-of-Experts
Abstract: Knowledge distillation (KD) aims to transfer useful information from a large-scale model (teacher) to a lightweight model (student).
Classical KD focuses on leveraging the teacher's predictions as soft labels to regularize student training.
However, the exact match of predictions in Kullback-Leibler (KL) divergence could be somewhat in conflict with the classification objective, given that the distribution discrepancies between teacher-generated predictions and ground-truth annotations tend to be fairly severe.
In this paper, we rethink the role of teacher predictions from a Mixture-of-Experts (MoE) perspective and transfer knowledge by introducing teacher predictions as latent variables to reformulate the classification objective.
This MoE strategy results in breaking down the vanilla classification task into a mixture of easier subtasks with the teacher classifier as a gating function to weigh the importance of subtasks.
Each subtask is efficiently conquered by distinct experts that are effectively implemented by resorting to multi-level teacher outputs.
We further develop a theoretical framework to formulate our method, termed MoE-KD, as an Expectation-Maximization (EM) algorithm and provide proof of the convergence.
Extensive experiments manifest that MoE-KD outperforms advanced knowledge distillers on mainstream benchmarks.
Primary Area: other topics in machine learning (i.e., none of the above)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4895
Loading