Evaluating Interpretable Methods via Geometric Alignment of Functional Distortions

Anna Hedström; Philine Lou Bommer; Thomas F Burns; Sebastian Lapuschkin; Wojciech Samek; Marina MC Höhne

Evaluating Interpretable Methods via Geometric Alignment of Functional Distortions

Anna Hedström, Philine Lou Bommer, Thomas F Burns, Sebastian Lapuschkin, Wojciech Samek, Marina MC Höhne

Published: 10 Feb 2025, Last Modified: 10 Feb 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Interpretability researchers face a universal question: without access to ground truth labels, how can the faithfulness of an explanation to its model be determined? Despite immense efforts to develop new evaluation methods, current approaches remain in a pre-paradigmatic state: fragmented, difficult to calibrate, and lacking cohesive theoretical grounding. Observ- ing the lack of a unifying theory, we propose a novel evaluative criterion entitled Generalised Explanation Faithfulness (GEF) which is centered on explanation-to-model alignment, and integrates existing perturbation-based evaluations to eliminate the need for singular, task-specific evaluations. Complementing this unifying perspective, from a geometric point of view, we reveal a prevalent yet critical oversight in current evaluation practice: the failure to account for the learned geometry, and non-linear mapping present in the model, and explanation spaces. To solve this, we propose a general-purpose, threshold-free faithfulness evaluator GEF that incorporates principles from differential geometry, and facilitates evaluation agnostically across tasks, and interpretability approaches. Through extensive cross-domain benchmarks on natural language processing, vision, and tabular tasks, we provide first-of-its-kind insights into the comparative performance of various interpretable methods. This includes local linear approximators, global feature visualisation methods, large language models as post-hoc explainers, and sparse autoencoders. Our contributions are important to the interpretability and AI safety communities, offering a principled, unified approach for evaluation.

Certifications: Survey Certification

Submission Length: Long submission (more than 12 pages of main content)

Changes Since Last Submission: - Uploading camera-ready version - Fixing typos, adding code links, polishing figures - Changing the title - Adding an acknowledgment

Code: https://github.com/annahedstroem/GEF/

Supplementary Material: zip

Assigned Action Editor: ~Colin_Raffel1

Submission Number: 3341

Loading