Keywords: Large language models, reinforcement learning, reasoning, on-policy exploration
TL;DR: A reinforcement learning framework that leverages human- or oracle-written solution prefixes to guide on-policy exploration, enabling LLMs to learn from hard problems that standard RL fails to solve.
Abstract: Reinforcement learning (RL) has improved the reasoning abilities of large language models (LLMs), yet state-of-the-art methods cannot use all training problems in a training dataset. On-policy RL rarely produces even a single correct rollout on hard problems, yielding no reward signal or learning altogether. Moreover, mixing easy problems into the training set can detrimental as on-policy RL may derive a larger signal to sharpen its distribution from these problems, impairing its ability to solve harder problems reliably. While one might attempt to address this by distilling human- or model-written solutions into models, these traces are not only expensive and hard to write, but also serve as poor fine-tuning targets: while they produce correct outputs, these concise paths are extremely challenging to learn from. We introduce Privileged On-Policy Exploration (POPE), a framework that leverages already available solutions from humans or other models to obtain a learning signal on hard problems by using them as "privileged" information that guides exploration. Concretely, POPE augments hard prompts with a minimal solution prefix as guidance, enabling RL to obtain non-zero rewards when rolling out conditioned on this prefix. We show that this approach allows RL to acquire behaviors that transfer back to original problems. This process expands the set of solvable problems and improves performance on challenging reasoning benchmarks.
Submission Number: 255
Loading