Align and Filter: Improving Performance in Asynchronous On-Policy RL

Homayoun Honari; Roger Creus Castanyer; Michael Przystupa; Michael Noukhovitch; Pablo Samuel Castro; Glen Berseth

Align and Filter: Improving Performance in Asynchronous On-Policy RL

Homayoun Honari, Roger Creus Castanyer, Michael Przystupa, Michael Noukhovitch, Pablo Samuel Castro, Glen Berseth

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: On-Policy RL, Asynchronous RL, Policy Lag, RLHF

TL;DR: We propose a new policy gradient algorithm that uses the TV divergence to improve training robustness to increased asynchronicity.

Abstract: Distributed training and increasing the gradient update frequency are practical strategies to accelerate learning and improve performance, but both exacerbate a central challenge: _policy lag_, which is the mismatch between the behavior policy generating data and the learning policy being updated. Policy lag can hinder the scaling of on-policy learning algorithms to larger problems. In this paper, we identify the sources of policy lag caused by distributed learning and high update frequency. We use the findings to propose _total Variation-based Advantage aligned Constrained policy Optimization (VACO)_ as a practical approach to mitigate policy lag. We empirically validate our method and show that it offers better robustness to policy lag in classic robotics RL tasks and a modern RL for LLM math reasoning task.

Primary Area: reinforcement learning

Submission Number: 10141

Loading