Keywords: reinforcement learning, Mixture-of-Experts
Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a powerful tool for improving reasoning and code generation, yet training Mixture-of-Experts (MoE) policies remains fragile and can suffer from reward collapse.
We diagnose a MoE-specific instability mechanism: router shift (RS) across policy updates amplifies off-policy mismatch, yielding increasingly volatile importance-ratio signals and bursty clipping activity that precede collapse.
Motivated by this diagnosis, we propose \textbf{R}outer-\textbf{S}hift \textbf{P}olicy \textbf{O}ptimization (RSPO), which computes a per-token router-shift ratio on the old activated experts, applies stop-gradient and a lower-bound floor, and softly rescales importance ratios before clipping and aggregation.
In a small-scale Qwen2.5-MoE Countdown setting, router-shift weighting acts as a plug-in stabilization module for GRPO/GSPO/GMPO, improving training stability and final reward.
On Qwen3-30B-A3B, RSPO (GMPO+RS) improves Pass@1 on both math and code benchmarks and stabilizes routing- and optimization-side diagnostics compared to GRPO.
Overall, our findings highlight router-aware trust weighting as a practical design principle that offers useful insight for building more stable and effective off-policy RL training pipelines for large MoE models.
Paper Type: Long
Research Area: Language Models
Research Area Keywords: Machine Learning for NLP; Language Modeling;
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 9668
Loading