Binary Search for RLVR

ICLR 2026 Conference Submission16240 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Binary Search, RLVR
Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) is a powerful paradigm, yet it suffers from a critical inefficiency: the profound underutilization of rare successes. The challenge is two-fold: exploring to *find* a successful trajectory, and then *learning* effectively from it. While many methods focus on the former, we address the latter. For challenging tasks where rewards are awarded only to complete, successful trajectories, such successes are rare. A single such trajectory is therefore a goldmine of information, but conventional methods treat it as just one data point, wasting a crucial learning opportunity. We introduce Binary Attribution of Sparse Signals (BASS), a method that reframes the problem from finding successes to maximizing the learning extracted from them. BASS treats a verified successful trajectory not as an answer, but as a *blueprint* to be deconstructed. It performs a binary search over the trajectory's prefixes to locate the model's *edge of competence*, i.e., the boundary where correct reasoning can falter. This process unlocks the full value of a single success by generating a rich, contrastive group of *near-miss* negatives (failures from good prefixes) and *far-reach* positives (diverse successes from shorter prefixes), providing the nuanced feedback required for robust policy optimization. Unlike methods focused on proactive exploration, BASS is a reactive, credit-focused mechanism that ensures every hard-won success is maximally leveraged to sharpen the policy. On average across three challenging math benchmarks with Qwen3-8B, BASS improves the avg@32 score by $+2.7$ percentage points ($\mathrm{pp}$) over the GRPO baseline, demonstrating that meticulous learning from rare successes leads to more robust and generalizable reasoning.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 16240
Loading