CASPO: Confidence-aware Step-wise Preference Optimization for Reliable Reasoning in Large Language Models

05 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Reasoning Models, Preference Optimization, Step-wise alignment
TL;DR: CASPO leverages step-wise confidence to guide training and inference, improving the reliability of multi-step reasoning in LLMs.
Abstract: Large language models (LLMs) have demonstrated strong performance on tasks requiring multi-step reasoning, from mathematical derivations to knowledge-intensive open-domain generation. However, even when LLMs produce correct final answers, their reasoning processes often involve uncertain or inconsistent steps, which makes them prone to failure when facing similar problems again. To address this issue, we introduce CASPO, a framework that incorporates step-wise confidence into both training and inference. During training, CASPO constructs confidence-filtered preference pairs that capture both correct but low-confidence predictions and incorrect predictions, and optimizes them through iterative Direct Preference Optimization. During inference, we propose Confidence-aware Thought (CaT) strategy that prunes low-confidence reasoning trajectories to enhance reliability. Experiments on 10 reasoning benchmarks and across diverse model families show that CASPO yields improvements in both step-wise faithfulness and final-answer accuracy. We also release a step-wise dataset with confidence annotations to facilitate fine-grained analysis of model reasoning and expose hidden inconsistencies in existing benchmarks.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 2410
Loading