Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning

Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning

ACL ARR 2026 January Submission8133 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Normalized Logit Difference Decay, Chain of Thought, Reasoning Horizon, Large Language Models, Representational Similarity Analysis, Trajectory Alignment Score

Abstract: Chain-of-Thought (CoT) explanations are widely used to interpret how language models solve complex problems, yet it remains unclear whether these step-by-step explanations reflect how the model actually reaches its answer, or merely post-hoc justifications. We propose Normalized Logit Difference Decay (NLDD), a metric that measures whether individual reasoning steps are faithful to the model's decision-making process. Our approach corrupts individual reasoning steps from the explanation and measures how much the model's confidence in its answer drops, to determine if a step is truly important. By standardizing these measurements, NLDD enables rigorous cross-model comparison across different architectures. Testing three model families across syntactic, logical, and arithmetic tasks, we discover a consistent Reasoning Horizon (k*) at 70–85\% of chain length, beyond which reasoning tokens have little or negative effect on the final answer. We also find that models can encode correct internal representations while completely failing the task. These results show that accuracy alone does not reveal whether a model actually reasons through its chain. NLDD offers a way to measure when CoT matters.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: counterfactual/contrastive explanations, explanation faithfulness, probing

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 8133

Loading