The Mirage of Explainability: A Survey on Chain-of-Thought Faithfulness in Large Language Models

ACL ARR 2026 January Submission5341 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Chain-of-Thought, Large Language Models, Faithfulness, Explanable NLP
Abstract: Chain-of-Thought (CoT) reasoning appears to provide explainability, leading users to trust that verbalized rationales reflect the model's underlying computation. However, substantial evidence indicates that CoT often fails to reflect the model's actual decision-making process, leading to a surge of research into the faithfulness of these explanations. This paper presents a comprehensive survey of CoT faithfulness. We first unify the definition of faithfulness by integrating internal alignment with external consistency and synthesize key failure phenomena, such as post-hoc rationalization and sycophancy. Furthermore, we systematize evaluation metrics, benchmarks, and critically review current mitigation strategies. We conclude by outlining open challenges and advocating for architectural innovations to achieve genuinely faithful reasoning.
Paper Type: Long
Research Area: Special Theme (conference specific)
Research Area Keywords: Explainability of NLP Models
Contribution Types: Surveys
Languages Studied: English
Submission Number: 5341
Loading