REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Published: 25 Jan 2026, Last Modified: 06 Mar 2026CPAL 2026 (Recent Spotlight Track) PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: mixture-of-experts, moe, compresson, expert pruning, expert merging, merging, pruning, LLM, evaluation
TL;DR: We argue that pruning experts is superior to merging them for one-shot compression of MoE LLMs and introduces a new method, REAP, that achieves nearly lossless performance on generative tasks by minimizing the upper bound of the reconstruction error.
Abstract: Sparsely-activated Mixture-of-Experts (SMoE) models offer efficient pre-training and low latency but their large parameter counts create significant memory overhead, motivating research into expert compression. Contrary to recent findings favouring expert *merging* on discriminative benchmarks, we find that expert *pruning* is a superior strategy for generative tasks. We demonstrate that existing merging techniques introduce an irreducible error due to the loss of fine-grained routing control over experts. Leveraging this insight, we propose Router-weighted Expert Activation Pruning (REAP), a novel pruning criterion that considers both router gate-values and expert activation norms to minimize the reconstruction error bound. Across a diverse set of SMoE models ranging from 20B to 1T parameters, REAP consistently outperforms merging and other pruning methods on generative benchmarks, especially at 50% compression. Notably, our method achieves near-lossless compression on code generation tasks with Qwen3-Coder-480B and Kimi-K2, even after pruning 50% of experts.
Submission Number: 29
Loading