Keywords: Causal interventions, Other
Other Keywords: Faithfulness, Explainable AI, Causal Abstraction
TL;DR: We propose three desiderata for faithfulness and position Causal Abstraction.
Abstract: Faithfulness is a broadly agreed-upon desideratum for explanations of machine
learning model predictions. While many different methods have been adopted
by the community, there is no agreed-upon definition of faithfulness. Here, we
propose three desiderata for faithfulness beyond the standard intuition of accurately
representing the reasoning process of the model, related to (1) enabling reverse-
engineering of specific behaviors, (2) capturing interventionist causal relations,
and (3) achieving an appropriate model decomposition. We argue that causal
abstraction satisfies these, and provides a framework for evaluating faithfulness
claims in the community.
Submission Number: 185
Loading