Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

11 Sept 2025 (modified: 29 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: RLVR, Reasoing with LLMs
TL;DR: A method to improve both Pass@1 and Pass@K performance for RLVR in complex reasoning.
Abstract: Reinforcement Learning with Verifiable Reward (RLVR) is a powerful method for enhancing the reasoning abilities of Large Language Models, but its full potential is limited by a lack of exploration in two key areas: \textbf{Depth} (the difficulty of problems) and \textbf{Breadth} (the number of training instances). Our analysis of the popular GRPO algorithm reveals a bias that down-weights difficult, low-accuracy problems, which are crucial for improving reasoning skills. To address this, we introduce Difficulty Adaptive Rollout Sampling (DARS), a method that re-weights difficult problems by using targeted, multi-stage rollouts. This approach increases the number of rollout outcomes for these harder problems according to our proposed re-balancing schedules and leads to consistent gains in \textit{Pass@K}. We also found that simply enlarging the rollout size isn't effective and can even harm performance. We also investigated the role of breadth by scaling the batch size and using full-batch updates. This significantly improved \textit{Pass@1} performance by maintaining high token-level entropy, which indicates continued exploration and reduced gradient noise. Finally, we present DARS-Breadth, a combined approach that uses DARS with a large breadth of training data. This method demonstrates simultaneous gains in both \textit{Pass@K} and \textit{Pass@1}, confirming that depth (adaptive exploration) and breadth (scaling the training data) are orthogonal and essential dimensions for unlocking the full reasoning power of RLVR.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 4061
Loading