Keywords: Mixture of Experts, Efficient Inference
Abstract: Mixture-of-Experts (MoE) models scale efficiently by activating only a subset of experts per token, offering a computationally sparse alternative to dense architectures. While prior post-training optimizations, such as inter- and intra-expert pruning, reduce memory usage but provide limited gains in inference-time compute efficiency. Moreover, existing MoE architectures typically activate a fixed number of experts uniformly across all layers, resulting suboptimal performance. In this work, we first demonstrate that MoE pruning improves only the memory footprint but does not significantly improve inference performance. To address this, we introduce \textbf{LExI}, a data-free optimization technique that determines the optimal number of active experts per layer in a pretrained MoE. LExI leverages only the model's weights to estimate the relative importance of each layer and adaptively assigns the number of active experts per layer. Experiments on several MoEs demonstrate that LExI significantly outperforms traditional MoE pruning approaches in terms of inference efficiency with negligible accuracy loss. For example, using LExI, Qwen1.5-MoE achieves the same throughput on Nvidia H100 GPU with 10\% better accuracy than traditional expert pruning.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 164
Loading