A-3PO: Accelerating Asynchronous LLM Training with Staleness-aware Proximal Policy Approximation

Published: 03 Mar 2026, Last Modified: 25 Mar 2026SPOTEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning, Policy Optimization, Large Language Models
TL;DR: A-3PO accelerates asynchronous RL-based LLM training with simple interpolation on the proximal policy, preserving provable stability while delivering 1.8× training speed without extra forward passes on the proximal policy.
Abstract: Decoupled PPO has been a successful reinforcement learning (RL) algorithm to deal with the high data staleness under the asynchronous RL setting. Decoupled loss used in decoupled PPO improves coupled-loss style of algorithms' (e.g., standard PPO, GRPO) learning stability by introducing a proximal policy to decouple the off-policy correction (importance weight) from the policy update constraint (trust region). However, the proximal policy requires an extra forward pass through the model at each training step, creating a computational overhead for large language models training. We observe that since the proximal policy only serves as a trust region anchor between the behavior and target policies, we can approximate it through simple interpolation without explicit computation. We call this approach A-3PO (APproximated Proximal Policy Optimization). A-3PO eliminates this overhead, accelerating training by 1.8$\times$ speedup while maintaining comparable performance. Code \& off-the-shelf example are anonymously contributed to the open-source RL training system AReaL at the Anonymous Github: https://anonymous.4open.science/r/A-3PO/docs/algorithms/prox_approx.md. We will replace this link with the real GitHub repository once the anonymity is no longer required.
Submission Number: 65
Loading