Data Selection for LLM Reinforcement Learning with Improved Gradient Alignment

Shipeng Li; Zhiqin Yang; Shikun Li; Xiaobo Xia; Hengyu Liu; Xinghua Zhang; Gaode Chen; Ying Tai; Zhe Peng

Data Selection for LLM Reinforcement Learning with Improved Gradient Alignment

Shipeng Li, Zhiqin Yang, Shikun Li, Xiaobo Xia, Hengyu Liu, Xinghua Zhang, Gaode Chen, Ying Tai, Zhe Peng

01 Sept 2025 (modified: 06 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Data Selection, Large Reaonsing Models, Gradient Alignment

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become a key technique for enhancing LLMs' reasoning abilities, yet its data inefficiency remains a major bottleneck. To address this critical yet challenging issue, we present a novel gradient-alignment-based method, named \textit{LearnAlign}, which intelligently selects the learnable and representative training reasoning data for RLVR post-training. To overcome the issue of response-length bias in gradient norms, we introduce the data learnability based on the success rate, which can indicate the learning potential of each data point. Experiments across five reasoning benchmarks show that our method significantly reduces training data requirements while achieving minor performance degradation or even improving performance compared to full-data training. Specifically, it reduces data requirements by up to 1,000 data points with better performance (77.5$\%$) than that on the full dataset on the GSM8K benchmark (77.0$\%$). Furthermore, its efficiency is demonstrated on both mathematical and code benchmarks by using much less data from the DAPO-MATH-17K dataset. We believe this work provides some insights for data-efficient RL post-training and could help future research on reasoning data selection. To facilitate future work, we will release code.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 419

Loading