Towards Stable and Effective Reinforcement Learning for Mixture-of-Experts

Towards Stable and Effective Reinforcement Learning for Mixture-of-Experts

ACL ARR 2026 January Submission9668 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reinforcement learning, Mixture-of-Experts

Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a powerful tool for improving reasoning and code generation, yet training Mixture-of-Experts (MoE) policies remains fragile and can suffer from reward collapse. We diagnose a MoE-specific instability mechanism: router shift (RS) across policy updates amplifies off-policy mismatch, yielding increasingly volatile importance-ratio signals and bursty clipping activity that precede collapse. Motivated by this diagnosis, we propose \textbf{R}outer-\textbf{S}hift \textbf{P}olicy \textbf{O}ptimization (RSPO), which computes a per-token router-shift ratio on the old activated experts, applies stop-gradient and a lower-bound floor, and softly rescales importance ratios before clipping and aggregation. In a small-scale Qwen2.5-MoE Countdown setting, router-shift weighting acts as a plug-in stabilization module for GRPO/GSPO/GMPO, improving training stability and final reward. On Qwen3-30B-A3B, RSPO (GMPO+RS) improves Pass@1 on both math and code benchmarks and stabilizes routing- and optimization-side diagnostics compared to GRPO. Overall, our findings highlight router-aware trust weighting as a practical design principle that offers useful insight for building more stable and effective off-policy RL training pipelines for large MoE models.

Paper Type: Long

Research Area: Language Models

Research Area Keywords: Machine Learning for NLP; Language Modeling;

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 9668

Loading