Three Desiderata for Faithfulness in Machine Learning Explanations: The Case for Causal Abstraction

Mette Friis Andersen; Maria Heuss; Ana Lucic

Three Desiderata for Faithfulness in Machine Learning Explanations: The Case for Causal Abstraction

Mette Friis Andersen, Maria Heuss, Ana Lucic

Published: 30 Sept 2025, Last Modified: 10 Nov 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Causal interventions, Other

Other Keywords: Faithfulness, Explainable AI, Causal Abstraction

TL;DR: We propose three desiderata for faithfulness and position Causal Abstraction.

Abstract: Faithfulness is a broadly agreed-upon desideratum for explanations of machine learning model predictions. While many different methods have been adopted by the community, there is no agreed-upon definition of faithfulness. Here, we propose three desiderata for faithfulness beyond the standard intuition of accurately representing the reasoning process of the model, related to (1) enabling reverse- engineering of specific behaviors, (2) capturing interventionist causal relations, and (3) achieving an appropriate model decomposition. We argue that causal abstraction satisfies these, and provides a framework for evaluating faithfulness claims in the community.

Submission Number: 185

Loading