MIB: A Mechanistic Interpretability Benchmark

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We propose a benchmark to establish lasting standards for comparing causal localization methods.
Abstract: How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of lasting evaluation standards, we propose MIB, a Mechanistic Interpretability Benchmark, with two tracks spanning four tasks and five models. MIB favors methods that precisely and concisely recover relevant causal pathways or causal variables in neural language models. The circuit localization track compares methods that locate the model components---and connections between them---most important for performing a task (e.g., attribution patching or information flow routes). The causal variable track compares methods that featurize a hidden vector, e.g., sparse autoencoders (SAE) or distributed alignment search (DAS), and align those features to a task-relevant causal variable. Using MIB, we find that attribution and mask optimization methods perform best on circuit localization. For causal variable localization, we find that the supervised DAS method performs best, while SAEs features are not better than neurons, i.e., non-featurized hidden vectors. These findings illustrate that MIB enables meaningful comparisons, and increases our confidence that there has been real progress in the field.
Lay Summary: A primary goal of mechanistic interpretability research is to understand how large language models make decisions by finding the specific parts inside these models that handle specific tasks, and then figuring out what these parts do. Many different methods have been developed to locate these important components, but until now, there was no standard way to compare how well these methods actually work. In this paper, we propose MIB (a Mechanistic Interpretability Benchmark), a large-scale dataset for evaluating and comparing mechanistic interpretability methods. The benchmark has two parts: one checks whether a method can find all the important components for a specific task (regardless of what they do); the other tests whether, given a concept, a method can locate where that concept is computed in the model. Using this benchmark, we find that some approaches (generally newer ones) are consistently more effective across different models and tasks. We also recover known findings, which acts as a sanity check. This provides evidence that the field is making real progress in understanding which components do what in language models.
Link To Code: https://github.com/aaronmueller/MIB
Primary Area: Social Aspects->Accountability, Transparency, and Interpretability
Keywords: interpretability, causality, localization, benchmarking, circuits
Submission Number: 13210
Loading