Chain-of-Thought Degrades Abstention in Large Language Models, Unless Inverted

ICLR 2026 Conference Submission13714 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: abstention, model safety, chain of thought
Abstract: For Large Language Models (LLMs) to be reliably deployed, models must effectively know when not to answer: *abstain*. Chain-of-Thought (CoT) prompting has been gained popularity for improving model performance by ensuring structured outputs that follow a logical sequence. In this paper, we first investigate how current abstention methods perform with CoT outputs, finding that direct use of reasoning traces can degrade performance of existing abstention methods by more than 5%. As a result, we introduce a new framework for thinking about hallucinations in LLMs not as answering a question incorrectly but instead as LLMs answering the *wrong* question. Based on this framework, we develop a new class of state-of-the-art abstention methods called **Trace Inversion**. First, we generate the reasoning trace of a model. Based on only the trace, we then reconstruct the most likely query that the model responded to. Finally, we compare the initial query with the reconstructed query. Low similarity score between the initial query and reconstructed query suggests that the model likely answered the question incorrectly and is flagged to abstain. We perform extensive experiments to find impressive performance gains with our Trace Inversion methods. The code is publicly available at: https://anonymous.4open.science/r/trace-inversion-9EE0/.
Supplementary Material: pdf
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 13714
Loading