PiCSAR: Probabilistic Confidence Selection and Ranking for Reasoning Chains

PiCSAR: Probabilistic Confidence Selection and Ranking for Reasoning Chains

ACL ARR 2026 January Submission6625 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Reasoning, Mathematical Reasoning, Probabilistic Confidence, Large Language Models, Large Reasoning Models

Abstract: Best-of-$n$ sampling improves the accuracy of large language models (LLMs) and large reasoning models (LRMs) by generating multiple candidate solutions and selecting the one with the highest reward. The key challenge for reasoning tasks is designing a scoring function that can identify correct reasoning chains without access to ground-truth answers. We propose Probabilistic Confidence Selection and Ranking for Reasoning Chains (PiCSAR): a simple, training-free method that scores each candidate generation using the joint log-likelihood of the reasoning and final answer. This method utilises both the scores of the reasoning path (*reasoning confidence*) and the final answer (*answer confidence*). PiCSAR achieves substantial gains across several benchmarks ($+11.7$ on AIME2024, $+9.81$ on AIME2025), outperforming baselines with at least 2x fewer samples in 20 out of 25 comparisons. Our analysis reveals that correct reasoning chains exhibit higher reasoning and answer confidence, justifying the effectiveness of PiCSAR.

Paper Type: Long

Research Area: Question Answering

Research Area Keywords: reasoning, math QA

Contribution Types: Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 6625

Loading