Keywords: mechanistic interpretability, conceptual alignment, explanations
Abstract: Interpretability is the process of explaining neural networks in a human-understandable way. A good explanation has three core components: it is (1) faithful to the explained model, (2) understandable to the interpreter, and (3) effectively communicated. We argue that current mechanistic interpretability methods focus primarily on faithfulness and could improve by additionally considering the human interpreter and communication process. We propose and analyse two approaches to \emph{Concept Enrichment} for the human interpreter -- \emph{Pre-Explanation Learning} and \emph{Mechanistic Socratic Explanation} -- approaches to using the AI's representations to teach the interpreter novel and useful concepts. We reframe the Interpretability Problem as a Bidirectional Communication Problem between the model and the interpreter, highlighting interpretability's pedagogical aspects. We suggest that Concept Enrichment may be a key way to aid Conceptual Alignment between AIs and humans for improved mutual understanding.
Submission Type: Short Paper (4 Pages)
Archival Option: This is a non-archival submission
Presentation Venue Preference: ICLR 2025
Submission Number: 77
Loading