Keywords: Reinforcement learning for language models, Policy optimization, Efficiency
Abstract: Reinforcement learning (RL) has become central to post-training large language models (LLMs).
However, popular RL methods like GRPO incur non-negligible overhead by computing both old-policy and current-policy likelihoods to form importance sampling ratios.
In this work, we propose Likelihood-Gated Policy Optimization (LGPO), which enforces a soft trust region constraint via likelihood-based gating, eliminating the need to compute old-policy likelihoods.
Empirically, we show that removing the importance sampling correction term does not harm training stability, whereas removing the trust region mechanism leads to collapse.
Moreover, ratio-based clipping can fail in fully on-policy training: the importance ratio stays at 1, so the ratio-based trust region constraint never activates.
Under standard training settings where GRPO is stable, LGPO achieves comparable training stability and peak performance while reducing training time by ~18% on average.
In fully on-policy training, where GRPO fails, LGPO remains stable, enabling more efficient and robust LLM RL post-training across training regimes.
Submission Number: 46
Loading