REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: mixture-of-experts, moe, compresson, expert pruning, expert merging, merging, pruning, LLM, evaluation
TL;DR: This paper argues that pruning experts is superior to merging them for compressing MoE LLMs and introduces a new method, REAP, that achieves nearly lossless performance on generative tasks by preserving the model's functional structure.
Abstract: Sparsely-activated Mixture-of-Experts (SMoE) models offer efficient pre-training and low latency but their large parameter counts create significant memory overhead, motivating research into expert compression. Contrary to recent findings favouring expert *merging* on discriminative benchmarks, we find that expert *pruning* is a superior strategy for generative tasks. We demonstrate that existing merging techniques introduce an irreducible error due to the loss of fine-grained routing control over experts. Leveraging this insight, we propose Router-weighted Expert Activation Pruning (REAP), a novel pruning criterion that considers both router gate-values and expert activation norms to minimize the reconstruction error bound. Across a diverse set of SMoE models ranging from 20B to 1T parameters, REAP consistently outperforms merging and other pruning methods on generative benchmarks, especially at 50% compression. Notably, our method achieves near-lossless compression on code generation tasks with Qwen3-Coder-480B and Kimi-K2, even after pruning 50% of experts.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 13701
Loading