Robust Policy Gradient Optimization through Action Parameter Perturbation in Reinforcement Learning

ICLR 2026 Conference Submission20796 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Policy Optimization, Policy Gradient Methods, Implicit Regularization, Reinforcement Learning
TL;DR: We propose RPO, a policy gradient method that introduces action parameter-space perturbations during optimization, implicitly regularizing the objective to enhance performance in on-policy reinforcement learning.
Abstract: Policy gradient methods have achieved strong performance in reinforcement learning, yet remain vulnerable to premature convergence and poor generalization, especially in on-policy settings where exploration is limited. Existing solutions typically rely on entropy regularization or action noise, but these approaches either require sensitive hyperparameter tuning or alter the interaction dynamics rather than the optimization process itself. In this paper, we propose Robust Policy Optimization (RPO), a policy gradient method that introduces perturbations to the policy parameters only during optimization. This approach smooths the loss landscape and implicitly regularizes learning, reducing sensitivity to local irregularities while leaving policy behavior during data collection unchanged. We provide a theoretical perspective showing that RPO implicitly biases updates toward flatter and more stable solutions. Empirically, RPO significantly improves upon PPO and entropy-regularized variants across diverse continuous control benchmarks, achieving faster convergence, higher returns, and greater robustness.
Primary Area: reinforcement learning
Submission Number: 20796
Loading