Provably Tracking Equivalent Mechanistic Interpretations Across Neural Networks

Provably Tracking Equivalent Mechanistic Interpretations Across Neural Networks

ICLR 2026 Conference Submission18899 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: mechanistic interpretability

TL;DR: We introduce interpretive equivalence as a way to formally compare different mechanistic model explanations

Abstract: Mechanistic interpretability (MI) is an emerging framework for interpreting neural networks. Given a task and model, MI aims to discover a succinct algorithmic process, an interpretation, that explains the model's decision process on that task. However, MI is difficult to scale and generalize. This stems in part from two key challenges: the lack of a well-defined notion of a valid interpretation; and, the ad hoc nature of generating and searching for such explanations. In this paper, we address these challenges by formally defining and studying the problem of interpretive equivalence: determining whether two different models share a common interpretation, without requiring an explicit description of what that interpretation is. At the core of our approach, we propose and formalize the principle that two interpretations of a model are (approximately) equivalent if and only if all of their possible implementations are also (approximately) equivalent. We develop tractable algorithms to estimate interpretive equivalence and case study their use on Transformer-based models. To analyze our algorithms, we introduce necessary and sufficient conditions for interpretive equivalence grounded in the similarity of their neural representations. As a result, we provide the first theoretical guarantees that simultaneously relate a model's algorithmic interpretations, circuits, and representations. Our framework lays a foundation for the development of more rigorous evaluation methods of MI and automated, generalizable interpretation discovery methods.

Supplementary Material: zip

Primary Area: interpretability and explainable AI

Submission Number: 18899

Loading