On-Policy Policy Gradient Reinforcement Learning Without On-Policy Sampling

Nicholas E. Corrado; Josiah P. Hanna

On-Policy Policy Gradient Reinforcement Learning Without On-Policy Sampling

Nicholas E. Corrado, Josiah P. Hanna

Published: 01 Jul 2025, Last Modified: 21 Jul 2025Finding the Frame (RLC 2025)EveryoneRevisionsBibTeXCC BY 4.0

Keywords: reinforcement learning, on-policy, policy gradient, data collection

TL;DR: We introduce a non-i.i.d., off-policy sampling method to produce data that more closely matches the expected on-policy data distribution than on-policy sampling can produce, thus improving the data efficiency of on-policy policy gradient algorithms.

Abstract: On-policy reinforcement learning (RL) algorithms are typically characterized as algorithms that perform policy updates using i.i.d. trajectories collected by the agent's current policy. However, after observing only a finite number of trajectories, on-policy sampling may produce data that fails to match the expected on-policy data distribution. This \textit{sampling error} leads to high-variance gradient estimates and data inefficient on-policy learning. Recent work in policy evaluation has shown that non-i.i.d., off-policy sampling can produce data with lower sampling error w.r.t. the expected on-policy distribution than on-policy sampling can produce. Motivated by this observation, we introduce an adaptive, off-policy sampling method to reduce sampling error during on-policy policy gradient RL training. Our method, Proximal Robust On-Policy Sampling (PROPS), reduces sampling error by collecting data with a \textit{behavior policy} that increases the probability of sampling actions that are under-sampled w.r.t. the current policy. We empirically evaluate PROPS on MuJoCo benchmark tasks and demonstrate that (1) PROPS decreases sampling error throughout training and (2) increases the data efficiency of on-policy policy gradient algorithms.

Submission Number: 26

Loading