Position: Interpretability is a Bidirectional Communication Problem

Kola Ayonrinde

Position: Interpretability is a Bidirectional Communication Problem

Kola Ayonrinde

Published: 06 Mar 2025, Last Modified: 05 May 2025ICLR 2025 Bi-Align Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: mechanistic interpretability, conceptual alignment, explanations

Abstract: Interpretability is the process of explaining neural networks in a human-understandable way. A good explanation has three core components: it is (1) faithful to the explained model, (2) understandable to the interpreter, and (3) effectively communicated. We argue that current mechanistic interpretability methods focus primarily on faithfulness and could improve by additionally considering the human interpreter and communication process. We propose and analyse two approaches to \emph{Concept Enrichment} for the human interpreter -- \emph{Pre-Explanation Learning} and \emph{Mechanistic Socratic Explanation} -- approaches to using the AI's representations to teach the interpreter novel and useful concepts. We reframe the Interpretability Problem as a Bidirectional Communication Problem between the model and the interpreter, highlighting interpretability's pedagogical aspects. We suggest that Concept Enrichment may be a key way to aid Conceptual Alignment between AIs and humans for improved mutual understanding.

Submission Type: Short Paper (4 Pages)

Archival Option: This is a non-archival submission

Presentation Venue Preference: ICLR 2025

Submission Number: 77

Loading