Explanation Faithfulness is Alignment: A Unifying and Geometric Perspective on Interpretability Evaluation
Abstract: Interpretability researchers face a universal question: without access to ground truth ex-planation labels, how can the faithfulness of an explanation to its model be determined? Despite immense efforts to develop new evaluation methods, current approaches remain in a pre-paradigmatic state: fragmented, difficult to calibrate, and lacking cohesive theoretical grounding. Observing the lack of a unifying theory, we propose a Generalised Explanation Faithfulness (GEF) evaluative criterion centred on alignment that combines existing perturbation-based evaluations, eliminating the need for singular, task-specific evaluations. Complementing this unifying perspective, from a geometric point of view, we reveal a prevalent yet critical oversight in current evaluation practice: the failure to account for the learned geometry and non-linear mapping present in the model and explanation spaces. To solve this, we propose a general-purpose, threshold-free faithfulness evaluator that incorporates principles from differential geometry, facilitating evaluation agnostically across tasks and explanation approaches. Through extensive cross-domain benchmarks on natural language processing, vision, and tabular tasks, we provide first-of-its-kind insights into the comparative performance of local linear approximations and global feature visualisation methods, and the faithfulness of large language models (LLMs) as post-hoc explainers. Our contributions are of substantial importance to the interpretability community, offering a principled, unified approach to evaluate the faithfulness of explanations. Code is available at url.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Colin_Raffel1
Submission Number: 3341
Loading