SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation

Zichong Li; Chen Liang; Zixuan Zhang; Ilgee Hong; Young Jin Kim; Weizhu Chen; Tuo Zhao

SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation

Zichong Li, Chen Liang, Zixuan Zhang, Ilgee Hong, Young Jin Kim, Weizhu Chen, Tuo Zhao

Published: 08 Jul 2025, Last Modified: 26 Aug 2025COLM 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, Mixture of Experts, Structured Pruning, Knowledge Distillation

TL;DR: This paper introduces a novel multi-stage prune-and-distill framework that efficiently compresses Phi-MoE 3.5 into compact 7.6B and 3.8B parameter models that significantly outperform similarly-sized alternatives.

Abstract: The Mixture of Experts (MoE) architecture has emerged as a powerful paradigm for scaling large language models (LLMs) while maintaining inference efficiency. However, their substantial memory requirements make them prohibitively expensive to fine-tune or deploy in resource-constrained environments. To address this challenge, we propose \textit{SlimMoE}, a multi-stage compression framework that transforms large MoE models into significantly smaller and more efficient variants without the cost of training from scratch. Our method systematically reduces parameter counts by slimming experts and transferring knowledge through intermediate stages, effectively mitigating the performance degradation typical of one-shot pruning. Using SlimMoE, we compress Phi-3.5-MoE (41.9B total / 6.6B activated parameters) into two smaller models: Phi-mini-MoE (7.6B total / 2.4B activated) and Phi-tiny-MoE (3.8B total / 1.1B activated), using only 400B tokens -- less than 10\% of the original training data. These models can be fine-tuned on a single GPU (A100 for Phi-mini-MoE, A6000 for Phi-tiny-MoE), making them well suited for academic and resource-limited settings. Our experiments show that the compressed models outperform others of similar size and remain competitive with larger models. For example, Phi-mini-MoE matches or exceeds the performance of Phi-3-mini while using only two-thirds of the activated parameters and achieves comparable MMLU scores to LLaMA 3.1 8B with significantly lower latency. These results highlight that structured pruning combined with multi-stage distillation is an effective strategy for building high-quality, compact MoE models, enabling broader adoption of MoE architectures across diverse computational environments. We release our models at \url{https://huggingface.co/microsoft/Phi-mini-MoE-instruct} and \url{https://huggingface.co/microsoft/Phi-tiny-MoE-instruct}.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 1280

Loading