When Circuits Are Too Broad: Unit Tests for Mechanistic Interpretability

Muhammet Anil Yagiz

When Circuits Are Too Broad: Unit Tests for Mechanistic Interpretability

Muhammet Anil Yagiz

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 VirtualposterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Methods (probing, steering, causal interventions), Benchmarking Interpretability

TL;DR: Mechanistic interpretability claims should pass negative-control unit tests, not only show that an intervention moves the target behavior.

Abstract: Mechanistic interpretability claims often show that intervening on a circuit, feature, or representation changes a target behavior. Such target effects are necessary but insufficient: broad, lexical, confounded, or template-fragile interventions can produce the same evidence. We propose mechanistic unit tests, a negative-control protocol for evaluating the specificity of circuit and feature claims. The protocol asks whether a proposed mechanism survives nuisance rewrites, fails on matched negatives, avoids off-target damage, and dominates cheap same-budget baselines. We summarize these trade-offs with specificity frontiers, plotting target effect against off-target damage across intervention strengths. A controlled case study and a small distilgpt2 pilot show how target-only evidence can hide lexical and negation failures. The contribution is a falsification layer for mechanistic claims, not a new discovery method.

Submission Number: 134

Loading