Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think

ACL ARR 2025 May Submission3910 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Language Models (LLMs) leverage step-by-step reasoning to solve complex problems. Standard evaluation practice involves generating a complete reasoning trace and assessing the correctness of the final answer. In this paper, we challenge the reliance on the final answer by posing the following two questions: Does the final answer reliably represent the model's optimal conclusion? Can alternative reasoning paths yield different results? To answer these questions, we analyze intermediate reasoning steps, termed subthoughts, and propose a method based on our findings. Our approach segments reasoning traces into subthoughts using linguistic cues. We then prompt the model for continuations from each subthought's end-point, extracting a potential answer from each. We find that aggregating these answers by selecting the most frequent one (the mode) often yields significantly higher accuracy compared to relying solely on the last answer. Analyzing the consistency among the answers derived from different subthoughts reveals characteristics that correlate with the model's correctness, suggesting potential for identifying incorrect responses. Our experiments across various LLMs and two mathematical reasoning datasets, AIME2024 and AIME2025, show consistent accuracy improvements, with gains reaching up to 13% and 10% respectively.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: reasoning, answer consistency
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 3910
Loading