The Mirage of Explainability: A Survey on Chain-of-Thought Faithfulness in Large Language Models

The Mirage of Explainability: A Survey on Chain-of-Thought Faithfulness in Large Language Models

ACL ARR 2026 January Submission5341 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Chain-of-Thought, Large Language Models, Faithfulness, Explanable NLP

Abstract: Chain-of-Thought (CoT) reasoning appears to provide explainability, leading users to trust that verbalized rationales reflect the model's underlying computation. However, substantial evidence indicates that CoT often fails to reflect the model's actual decision-making process, leading to a surge of research into the faithfulness of these explanations. This paper presents a comprehensive survey of CoT faithfulness. We first unify the definition of faithfulness by integrating internal alignment with external consistency and synthesize key failure phenomena, such as post-hoc rationalization and sycophancy. Furthermore, we systematize evaluation metrics, benchmarks, and critically review current mitigation strategies. We conclude by outlining open challenges and advocating for architectural innovations to achieve genuinely faithful reasoning.

Paper Type: Long

Research Area: Special Theme (conference specific)

Research Area Keywords: Explainability of NLP Models

Contribution Types: Surveys

Languages Studied: English

Submission Number: 5341

Loading