Abstract: Mixture-of-Experts (MoE) architectures face challenges such as high memory consumption and redundancy in experts. Pruning MoE can reduce network weights while maintaining model performance. Motivated by the recent observation of emergent large magnitude features in Large Language Models (LLM) and MoE routing policy, we propose MoE-Pruner, a method that prunes weights with the smallest magnitudes multiplied by the corresponding input activations and router weights. Our pruning method is one-shot, requiring no retraining or weight updates. Furthermore, our pruned MoE models can benefit from a pretrained teacher model through expert-wise knowledge distillation, improving performance post-pruning. We evaluate our method on various MoE models, such as Mixtral and DeepSeek, across multiple zero-shot evaluation benchmarks. Experimental results demonstrate that our pruning method significantly outperforms state-of-the-art LLM pruning methods. The pruned model with 50% sparsity maintains 99% of the performance of the original model after the expert-wise knowledge distillation.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: Mixture-of-Experts, MoE, Router, Efficiency, Pruning, Sparsity, Acceleration
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 1414
Loading