Keywords: Mixture-of-experts, pruning
Abstract: Ultra-large Mixture-of-Experts (MoE) language models, \eg DeepSeek-R1, are rapidly emerging as a dominant architecture due to their superior scalability and performance. However, the massive number of expert parameters introduces substantial redundancy, posing serious challenges for efficient deployment.
Existing pruning methods face two fundamental challenges when applied to such MoE architectures.
First, while methods based on reconstruction loss offer a more comprehensive selection by considering each expert combination, the vast search space renders exhaustive evaluation infeasible.
Second, most approaches rely on a fixed calibration dataset to guide pruning, which often fails to preserve the model’s full capabilities.
To address these challenges, we introduce two key innovations in our pruning framework. First, we propose a \emph{Coarse-to-Fine Expert Selection} strategy that reduces the computational complexity of
reconstruction-loss–based selection
from an exponential ($\mathcal{O}(\binom{2n}{n})$) to a polynomial scale ($\mathcal{O}(n^{1.5})$) with respect to the number of experts. This significantly accelerates the pruning process without sacrificing selection quality.
Second, we develop a \emph{Dynamic Calibration Dataset Mixing} strategy that enables the model to adaptively adjust its calibration set during pruning.
Extensive experiments on a range of benchmarks show that our method can prune 50\% of the experts in a large-scale MoE model (\eg DeepSeek-R1) while retaining 98.9\% of its original performance across diverse tasks, outperforming existing pruning baselines. Our approach also demonstrates practical speedups and reduced memory footprint, facilitating efficient real-world deployment.
The anonymous implementation is available at \url{https://anonymous.4open.science/r/DCDM-4C65-622a2bad88498795b8d7a92d85aca1315f9520ee}.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 15245
Loading