Keywords: MoE, LLM, Pruning
Abstract: Most deployed large language model applications benefit more from specialized models than from ever-larger generalists. While Mixture-of-Experts (MoE) models learn specialists and activate only a subset of experts per token, they typically retain far more experts than needed for any specific task. This inflates inference latency and memory usage without proportional performance gains.
We present LEAP (Learning Expert Adaptation and Pruning), a principled framework that decouples model structure from behavior through agentic optimization. Our approach uses a meta--reinforcement-learning Pruning Agent to search the combinatorial space of expert subsets, optimizing for both performance and efficiency to identify compact, task-specific expert configurations. After pruning, we reconfigure the original router as a Routing Agent and train it using PPO. Additionally, Active Learning identifies the most informative, high-uncertainty samples to accelerate model recovery and specialization.
We evaluate LEAP on Llama 4 Maverick (17Bx128E) and Qwen3-235B-A22B across three diverse tasks: HumanEval (code generation), GSM8K (mathematical reasoning), and XSum (summarization). LEAP retains $>94\%$ of the original model quality while using $8\times$ fewer activated experts per token. This translates to up to $\mathbf{2.5\times}$ faster per-token inference, $0.31\times$ FLOPs, and $\sim40\%$ lower peak memory usage compared to the full 128-expert models. Our method establishes a Pareto-dominant accuracy--compute frontier, consistently outperforming SoTA techniques including frequency-based pruning, magnitude-based pruning, and vanilla fine-tuning approaches.
Ablation studies demonstrate that learned pruning significantly outperforms heuristic methods, active learning reduces labeled data requirements by $2.1\times$, and PPO-based routing is essential for maintaining post-pruning performance. By transforming expert selection and routing into a closed-loop, learnable process, LEAP provides a practical pathway to specialized, efficient MoE models and advances toward scalable, agentic optimization of expert systems.
Code: https://anonymous.4open.science/r/LEAP2-4668
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 13755
Loading