The Hypocrisy Gap: Quantifying Divergence Between Internal Belief and Chain-of-Thought Explanation via Sparse Autoencoders

The Hypocrisy Gap: Quantifying Divergence Between Internal Belief and Chain-of-Thought Explanation via Sparse Autoencoders

ACL ARR 2026 January Submission8699 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mechanistic Interpretability, Sparse Autoencoders, Faithfulness, Sycophancy, Chain-of-Thought, Large Language Models, Probing, AI Safety

Abstract: Large Language Models (LLMs) frequently exhibit unfaithful behavior, producing a final answer that differs significantly from their internal chain of thought (CoT) reasoning in order to appease the user they are conversing with. In order to better detect this behavior, we introduce the Hypocrisy Gap, a mechanistic metric utilizing Sparse Autoencoders (SAEs) to quantify the divergence between a model's internal reasoning and its final generation. By mathematically comparing an internal truth belief, derived via sparse linear probes, to the final generated trajectory in latent space, we quantify and detect a model's tendency to engage in unfaithful behavior. Experiments on Gemma, Llama, and Qwen models using Anthropic's Sycophancy benchmark show that our method achieves an AUROC of 0.55–0.73 for detecting sycophantic runs and 0.55–0.74 for hypocritical cases where the model internally “knows” the user is wrong, consistently outperforming a decision-aligned log-probability baseline (0.41–0.50 AUROC).

Paper Type: Short

Research Area: Special Theme (conference specific)

Research Area Keywords: interpretability; explanation faithfulness; probing; safety and alignment; reasoning

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 8699

Loading