Rethinking Knowledge Distillation: A Mixture-of-Experts Perspective

Bo Peng; Jie Lu; Guangquan Zhang; Zhen Fang

Rethinking Knowledge Distillation: A Mixture-of-Experts Perspective

Bo Peng, Jie Lu, Guangquan Zhang, Zhen Fang

25 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Knowledge Distillation, Mixture-of-Experts

Abstract: Knowledge distillation (KD) aims to transfer useful information from a large-scale model (teacher) to a lightweight model (student). Classical KD focuses on leveraging the teacher's predictions as soft labels to regularize student training. However, the exact match of predictions in Kullback-Leibler (KL) divergence could be somewhat in conflict with the classification objective, given that the distribution discrepancies between teacher-generated predictions and ground-truth annotations tend to be fairly severe. In this paper, we rethink the role of teacher predictions from a Mixture-of-Experts (MoE) perspective and transfer knowledge by introducing teacher predictions as latent variables to reformulate the classification objective. This MoE strategy results in breaking down the vanilla classification task into a mixture of easier subtasks with the teacher classifier as a gating function to weigh the importance of subtasks. Each subtask is efficiently conquered by distinct experts that are effectively implemented by resorting to multi-level teacher outputs. We further develop a theoretical framework to formulate our method, termed MoE-KD, as an Expectation-Maximization (EM) algorithm and provide proof of the convergence. Extensive experiments manifest that MoE-KD outperforms advanced knowledge distillers on mainstream benchmarks.

Primary Area: other topics in machine learning (i.e., none of the above)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4895

Loading