Faithfulness through Causal Abstraction: Aligning explanations of how models reason

Published: 30 Sept 2025, Last Modified: 30 Sept 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Causal interventions, Other
Other Keywords: Faithfulness, Explainable AI, Causal Abstraction
TL;DR: We propose desiderata for faithfulness and position Causal Abstraction.
Abstract: Faithfulness is a broadly agreed-upon desideratum for explanations of machine learning (ML) model predictions. While many different methods have been adopted by the community, there is no agreed-upon definition of faithfulness [1]. Here, we propose desiderata for faithfulness beyond the standard intuition of “accurately representing the reasoning process of the model" [2; 3]. We highlight a recently introduced mechanistic interpretability (MI) framework, referred to as Causal Abstraction (CA), and argue that CA provides a framework capable of aligning faithfulness claims in the community.
Submission Number: 185
Loading