Three Desiderata for Faithfulness in Machine Learning Explanations: The Case for Causal Abstraction

Published: 30 Sept 2025, Last Modified: 10 Nov 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Causal interventions, Other
Other Keywords: Faithfulness, Explainable AI, Causal Abstraction
TL;DR: We propose three desiderata for faithfulness and position Causal Abstraction.
Abstract: Faithfulness is a broadly agreed-upon desideratum for explanations of machine learning model predictions. While many different methods have been adopted by the community, there is no agreed-upon definition of faithfulness. Here, we propose three desiderata for faithfulness beyond the standard intuition of accurately representing the reasoning process of the model, related to (1) enabling reverse- engineering of specific behaviors, (2) capturing interventionist causal relations, and (3) achieving an appropriate model decomposition. We argue that causal abstraction satisfies these, and provides a framework for evaluating faithfulness claims in the community.
Submission Number: 185
Loading