InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Published: 24 Jun 2024, Last Modified: 16 Jul 2024ICML 2024 MI Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Benchmark, Mechanistic Interpretability, Transformers, Circuit Discovery
TL;DR: We present a benchmark of transformers with known circuits for evaluating mechanistic interpretability techniques.
Abstract: Mechanistic interpretability methods aim to identify the algorithm a neural network implements, but it is difficult to validate such methods when we don't know the algorithm in the first place. This work presents InterpBench, a benchmark of semi-synthetic but realistic transformers with known circuits, generated to evaluate such techniques. We propose an approach called Strict Interchange Intervention Training (SIIT) to create these semi-synthetic models. Like plain Interchange Intervention Training (IIT), SIIT trains neural networks to be aligned with high-level causal models, but improves on IIT by also preventing non-circuit nodes from affecting the model's output. We evaluate SIIT on sparse transformers produced by the Tracr tool and find that SIIT models maintains Tracr's original circuit while being more realistic. Finally, we use our benchmark to evaluate existing mechanistic interpretability techniques.
Submission Number: 110
Loading