ScaleMoE: Mixture-of-Experts for Scalable Continuous Control in Actor–Critic Reinforcement Learning

Yi Ma; Chenjun Xiao; Hongyao Tang; Yaodong Yang; Jing Liang; Jiye Liang

ScaleMoE: Mixture-of-Experts for Scalable Continuous Control in Actor–Critic Reinforcement Learning

Yi Ma, Chenjun Xiao, Hongyao Tang, Yaodong Yang, Jing Liang, Jiye Liang

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning; Mixture of Expert

Abstract: Scaling model size has been a key driver of progress in supervised learning, but remains a challenge in deep reinforcement learning (RL), where naively increasing the parameters of actor-critic networks often leads to instability and performance degradation. While recent architectures like SimBa and BRC have shown that careful inductive biases can enable positive scaling in continuous control, they remain monolithic, activating all parameters for every input. In this work, we introduce \textbf{ScaleMoE}, an architecture that integrates Mixture-of-Experts (MoE) modules into both the actor and critic of state-of-the-art continuous control algorithms. This approach effectively turns parameter growth into consistent performance gains. We propose two integration strategies: (i) \emph{output-level gating}, where a learned gating network selects the top-$K$ expert actors and critics per state and merges their outputs (policy means, variances, and $Q$-values) via gating weights; and (ii) \emph{feature-level gating}, where experts produce penultimate features that are combined by top-$K$ gating and passed through a shared output layer for both policy and value predictions. We implement ScaleMoE on a single-task actor–critic baseline (SimBa) and a multi-task baseline (BRC), two representative monolithic scaling RL methods. Experiments on the DeepMind Control Suite and HumanoidBench demonstrate improved returns as the number of experts increases. In multi-task settings, ScaleMoE with smaller experts matches or outperforms a larger monolithic network with substantially less model parameters. Our findings indicate that MoE offers an effective and compute-efficient scaling axis for deep RL in continuous control, narrowing the gap with supervised learning.

Primary Area: reinforcement learning

Submission Number: 7537

Loading