$\mu$P for RL: Mitigating Feature Inconsistencies During Reinforcement Learning

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning, rich feature learning, compute efficiency
TL;DR: We propose using the maximal update parameterization to reduce the inconsistencies encountered when training reinforcement learning agents.
Abstract: The maximal update parameterization ($\mu P$) has been influential in supervised and unsupervised learning, with fixed data distributions, owing to its ability to maintain feature learning across larger parameter scales. This causes more consistent learning dynamics and learned features across model sizes. In addition, optimal hyperparameters such as learning rate approximately transfer from small to larger models, minimizing the computational overhead of hyperparameter sweeps. However, it remains elusive if these benefits readily transfer to the reinforcement learning framework, where the model's learning dynamics are coupled to the shifting data distribution. Reinforcement learning agents must continually adapt to non-stationary data distribution shifts throughout training. We empirically study how two regimes of reinforcement learning agents under the "rich" CompleteP and "lazy" Neural Tangent Kernel (NTK) parameterizations affect hyperparameter transfer, feature and policy consistency. Ultimately, we show that agents trained using the CompleteP parameterization consequentially improves compute and reward efficiency compared to the NTK parameterization over 16 continuous control tasks and variants e.g. normalization and sparse rewards. Hence, we argue that adopting the CompleteP parameterization minimizes learning inconsistencies across model sizes to improve compute efficiency when scaling up.
Primary Area: reinforcement learning
Submission Number: 14262
Loading