Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

Zhicheng Yang; Zhijiang Guo; Yinya Huang; Yongxin Wang; Dongchun Xie; Yiwei Wang; Xiaodan Liang; Jing Tang

Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

Zhicheng Yang, Zhijiang Guo, Yinya Huang, Yongxin Wang, Dongchun Xie, Yiwei Wang, Xiaodan Liang, Jing Tang

11 Sept 2025 (modified: 29 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: RLVR, Reasoing with LLMs

TL;DR: A method to improve both Pass@1 and Pass@K performance for RLVR in complex reasoning.

Abstract: Reinforcement Learning with Verifiable Reward (RLVR) is a powerful method for enhancing the reasoning abilities of Large Language Models, but its full potential is limited by a lack of exploration in two key areas: \textbf{Depth} (the difficulty of problems) and \textbf{Breadth} (the number of training instances). Our analysis of the popular GRPO algorithm reveals a bias that down-weights difficult, low-accuracy problems, which are crucial for improving reasoning skills. To address this, we introduce Difficulty Adaptive Rollout Sampling (DARS), a method that re-weights difficult problems by using targeted, multi-stage rollouts. This approach increases the number of rollout outcomes for these harder problems according to our proposed re-balancing schedules and leads to consistent gains in \textit{Pass@K}. We also found that simply enlarging the rollout size isn't effective and can even harm performance. We also investigated the role of breadth by scaling the batch size and using full-batch updates. This significantly improved \textit{Pass@1} performance by maintaining high token-level entropy, which indicates continued exploration and reduced gradient noise. Finally, we present DARS-Breadth, a combined approach that uses DARS with a large breadth of training data. This method demonstrates simultaneous gains in both \textit{Pass@K} and \textit{Pass@1}, confirming that depth (adaptive exploration) and breadth (scaling the training data) are orthogonal and essential dimensions for unlocking the full reasoning power of RLVR.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 4061

Loading