Star-DS: Step-level Uncertainty-Aware Reasoning Data Selection in Reinforcement Learning for LLM Multi-step Reasoning

Star-DS: Step-level Uncertainty-Aware Reasoning Data Selection in Reinforcement Learning for LLM Multi-step Reasoning

ICLR 2026 Conference Submission15165 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reasoning LLMs, Data Selection, Uncertainty Estimation

Abstract: Large language models have demonstrated remarkable potential on complex multi-step reasoning tasks, largely enabled by substantial post-training via reinforcement learning with process reward verification on reasoning datasets. Recent studies have shown that it is possible to alleviate the massive data reliance and computational costs by selecting high-value subsets of data while maintaining reasoning capability. However, existing data selection methods typically rely only on outcome-level signals derived from final answers to measure data quality, overlooking step-level signals that are intrinsic to multi-step reasoning, which leads to suboptimal identification of valuable reasoning data. In this paper, we propose a novel Step-level Uncertainty-Aware Reasoning Data Selection approach (Star-DS) that incorporates both step-level and outcome-level signals for identifying high-value reasoning data in reinforcement learning for LLM multi-step reasoning. Specifically, we introduce step-wise self-evaluation uncertainty of each reasoning step, as well as reward variance of the final answer, to quantify the value of each sample for RL training. Experiments with diverse reasoning models across multiple benchmarks demonstrate that our approach consistently identifies high-value data, preserves multi-step reasoning performance after RL training, and significantly reduces both data requirements and computational costs.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 15165

Loading