Keywords: Sparse Mixture-of-Experts, Pruning, Compression
TL;DR: STEP prunes MoE models by leveraging importance tokens as guidance and converts pruned experts into adaptive bias vectors. This approach ensures minimal performance degradation while significantly improving efficiency.
Abstract: Mixture-of-Experts (MoE) architectures achieve exceptional scalability for large language models but present significant deployment challenges due to substantial expert parameter overhead. Existing expert pruning approaches rely on token-agnostic heuristics, such as routing frequency or similar statistical metrics. These methods dilute critical signals from important tokens, conflate statistical presence with functional importance, and completely discard pruned experts' knowledge. To address these limitations, we introduce ***STEP*** (Selective Token-guided Expert Pruning), a novel compression framework driven by three key innovations: (i) **Token-aware expert evaluation** that prioritizes informative tokens for context-sensitive expert assessment; (ii) **Loss-impact scoring** that quantifies expert importance through direct loss contribution rather than statistical proxy metrics; (iii) **Expert-to-bias conversion** that preserves domain knowledge via compact adaptive vectors, transforming pruning from a "discard-and-forget" to a "compress-and-preserve" paradigm. Extensive experiments demonstrate ***STEP***'s superiority across model scales and MoE architectures. At 50\% expert sparsity of the 30B Qwen model, our pruning method achieves nearly a 50\% reduction in memory usage with minimal performance degradation. The method is accompanied by a 1.5$\times$ throughput acceleration, and the entire process of pruning and converting the model completes within just 10 minutes. This enables efficient and scalable deployment of MoE models.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 7474
Loading