Input-Aware Expert Pruning for Efficient MoE Deployment

ACL ARR 2025 February Submission1647 Authors

14 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Mixture-of-Experts (MoE) models, a primary method for scaling parameters to boost performance in large language models (LLMs), require substantial memory when deployed in downstream systems. To mitigate this, existing methods often prune or compress parameters before inference to reduce memory usage. Yet, such static optimizations conflict with the MoE design philosophy: expert activation is input-dependent. To resolve this issue, we introduce in**P**ut-awa**R**e **E**xpert **P**runing (PREP), a method that dynamically identifies and retains only the most critical experts for each input, substantially lowering memory overhead while preserving model performance. Specifically, after derivation of expert importance, PREP deploys an input-dependent lightweight linear approximation of expert importance through efficient search in CPU. Incorporating a hardware-optimized mechanism of layer-by-layer loading of the experts, PREP achieves a minimal memory usage of **37.5%** compared with the base model. Experiments across diverse benchmarks demonstrate that our method outperforms prior compression techniques in accuracy while achieving the lowest inference latency. Code for reproducibility is available at https://anonymous.4open.science/r/PREP-5375.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: pruning, NLP in resource-constrained settings
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 1647
Loading