Keywords: explainability, formal explainability, guaranteed explainability
TL;DR: We introduce the notion of fixed point explanation to formally characterise and study the interplay between a model and its explainer.
Abstract: This paper introduces a formal notion of fixed point explanations, inspired by the “why regress” principle, to assess, through recursive applications, the stability of the interplay between a model and its explainer. Fixed point explanations satisfy properties like minimality, stability, and faithfulness, revealing hidden model behaviours and explanatory weaknesses. We define convergence conditions for several classes of explainers, from feature-based to mechanistic tools like Sparse AutoEncoders, and we report quantitative and qualitative results for several datasets and models, including LLMs such as Llama-3.3-70B.
Primary Area: interpretability and explainable AI
Submission Number: 11979
Loading