RRD: Routing-and-Residual Distillation for Efficient MoE Recovery in Large Language Models

Published: 29 May 2026, Last Modified: 29 May 2026HiLD at ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mixture-of-Experts, Knowledge Distillation, Routing distillation, Shared experts distillation
Abstract: Mixture-of-Experts (MoE) architectures improve inference efficiency by activating only a samll subset of parameters for each token. Recent dense-to-MoE conversion methods transform pretrained dense large language models into sparse MoEs through expert initialization, but practical top-$K$ routing prevents the converted model from fully reproducing the original dense computation. We view this recovery gap as arising from two coupled challenges: selecting appropriate experts and recovering information missed by the selected experts. We propose \emph{Routing-and-Residual Distillation} (RRD), a teacher-guided framework that distills routing targets from the original dense model and repurposes shared experts to recover the remaining representation gap. Experiments demonstrate that teacher-guided routing substantially improves sparse conversion and that combining routing and residual recovery yields more faithful dense-to-MoE transfer.\footnote[1]{Code: \url{https://anonymous.4open.science/r/rrd_moe-6B0B}}
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 180
Loading