RL-finetuning LLMs from on- and off-policy data with a single algorithm

Published: 03 Feb 2026, Last Modified: 06 Feb 2026AISTATS 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: LLM RL finetuning with both on-policy and off-policy data
Abstract: We introduce a novel reinforcement learning algorithm (AGRO, for Any-Generation Reward Optimization) for finetuning Large Language Models. AGRO leverages the concept of response consistency, which states that the optimal policy satisfies a notion of consistency across any possible generation of the model. We derive algorithms that find optimal solutions via sample-based policy gradient and provide theoretical guarantees on their convergence. Our experiments demonstrate the effectiveness of AGRO in both on-policy and off-policy settings, showing improved performance on the MATH dataset over baseline methods.
Submission Number: 199
Loading