Robust Policy Gradient Optimization through Parameter Perturbation in Reinforcement Learning

Published: 23 Sept 2025, Last Modified: 01 Dec 2025ARLETEveryoneRevisionsBibTeXCC BY 4.0
Track: Research Track
Keywords: Policy Optimization, Policy Gradient Methods, Implicit Regularization, Reinforcement Learning
Abstract: Policy gradient methods have achieved strong performance in reinforcement learning, yet remain vulnerable to premature convergence and poor generalization, especially in on-policy settings where exploration is limited. Existing solutions typically rely on entropy regularization or action noise, but these approaches either require sensitive hyperparameter tuning or affect the interaction dynamics rather than the optimization process itself. In this paper, we propose Robust Policy Optimization (RPO), a policy gradient method that introduces perturbations to the policy parameters only during optimization. This technique smooths the loss landscape and implicitly regularizes learning, reducing sensitivity to local irregularities without altering policy behavior during data collection. We provide a theoretical analysis showing that RPO adds a Hessian-based regularization term to the objective, biasing updates toward flatter, more robust solutions. RPO significantly improves upon PPO and entropy-regularized variants across diverse continuous control benchmarks, achieving faster convergence, higher returns, and greater robustness—without the need for entropy tuning.
Submission Number: 100
Loading