PiCSAR: Probabilistic Confidence Selection And Ranking for Reasoning Chains

Joshua Ong Jun Leang; Zheng Zhao; Aryo Pradipta Gema; Sohee Yang; Wai-Chung Kwan; Xuanli He; Wenda Li; Pasquale Minervini; Eleonora Giunchiglia; Shay B Cohen

PiCSAR: Probabilistic Confidence Selection And Ranking for Reasoning Chains

Joshua Ong Jun Leang, Zheng Zhao, Aryo Pradipta Gema, Sohee Yang, Wai-Chung Kwan, Xuanli He, Wenda Li, Pasquale Minervini, Eleonora Giunchiglia, Shay B Cohen

19 Sept 2025 (modified: 03 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Large Reasoning Models, Probabilistic Confidence, Mathematical Reasoning

Abstract: Best-of-$n$ sampling improves the accuracy of large language models (LLMs) and large reasoning models (LRMs) by generating multiple candidate solutions and selecting the one with the highest reward. The key challenge for reasoning tasks is designing a scoring function that can identify correct reasoning chains without access to ground-truth answers. We propose Probabilistic Confidence Selection And Ranking for Reasoning Chains (PiCSAR): a simple, training-free method that scores each candidate generation using the joint log-likelihood of the reasoning and final answer. This method utilises both the scores of the reasoning path (_reasoning confidence_) and the final answer (_answer confidence_). PiCSAR achieves substantial gains across diverse benchmarks ($+11.7$ on AIME2024, $+9.81$ on AIME2025), outperforming baselines with fewer than at least 2x samples in 20 out of 25 comparisons. Our analysis reveals that correct reasoning chains exhibit significantly higher reasoning and answer confidence, justifying the effectiveness of PiCSAR.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 19221

Loading