Did You Faithfully Say What You Thought? Bridging the Gap Between LLMs Neural Activity and Self-Explanations

ICLR 2026 Conference Submission21825 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: interpretability, explainable AI, self-explanation, faithfulness
TL;DR: This paper proposes NeuroFaith, a framework measuring the faithfulness of LLM free text self-explanations.
Abstract: Large Language Models (LLMs) can generate plausible free text self-explanations to justify their answers. However, these natural language explanations may not accurately reflect the model's actual reasoning process, indicating a lack of faithfulness. Existing faithfulness evaluation methods rely primarily on behavioral tests or computational block analysis without examining the semantic content of internal neural representations. This paper proposes NeuroFaith, a flexible framework that measures the faithfulness of LLM free text self-explanation by identifying key concepts within explanations and mechanistically testing whether these concepts actually influence the model's predictions. We show the versatility of NeuroFaith across 2-hop reasoning and classification tasks. Additionally, a linear faithfulness probe based on NeuroFaith is developed to detect unfaithful self-explanations from representation space and improve faithfulness through steering. NeuroFaith provides a principled approach to evaluating and enhancing the faithfulness of LLM free text self-explanations, addressing critical needs for trustworthy AI systems.
Primary Area: interpretability and explainable AI
Submission Number: 21825
Loading