Keywords: On-Policy RL, Asynchronous RL, Policy Lag, RLHF
TL;DR: We propose a new policy gradient algorithm that uses the TV divergence to improve training robustness to increased asynchronicity.
Abstract: Distributed training and increasing the gradient update frequency are practical strategies to accelerate learning and improve performance, but both exacerbate a central challenge: _policy lag_, which is the mismatch between the behavior policy generating data and the learning policy being updated. Policy lag can hinder the scaling of on-policy learning algorithms to larger problems. In this paper, we identify the sources of policy lag caused by distributed learning and high update frequency. We use the findings to propose _total Variation-based Advantage aligned Constrained policy Optimization (VACO)_ as a practical approach to mitigate policy lag. We empirically validate our method and show that it offers better robustness to policy lag in classic robotics RL tasks and a modern RL for LLM math reasoning task.
Primary Area: reinforcement learning
Submission Number: 10141
Loading