RL-finetuning LLMs from on- and off-policy data with a single algorithm
TL;DR: LLM RL finetuning with both on-policy and off-policy data
Abstract: We introduce a novel reinforcement learning algorithm (AGRO, for Any-Generation Reward Optimization) for finetuning Large Language Models. AGRO leverages the concept of response consistency, which states that the optimal policy satisfies a notion of consistency across any possible generation of the model. We derive algorithms that find optimal solutions via sample-based policy gradient and provide theoretical guarantees on their convergence. Our experiments demonstrate the effectiveness of AGRO in both on-policy and off-policy settings, showing improved performance on the MATH dataset over baseline methods.
Submission Number: 199
Loading