Keywords: Methods (probing, steering, causal interventions), Benchmarking Interpretability
TL;DR: Mechanistic interpretability claims should pass negative-control unit tests, not only show that an intervention moves the target behavior.
Abstract: Mechanistic interpretability claims often show that intervening on a circuit, feature, or representation changes a target behavior. Such target effects are necessary but insufficient: broad, lexical, confounded, or template-fragile interventions can produce the same evidence. We propose mechanistic unit tests, a negative-control protocol for evaluating the specificity of circuit and feature claims. The protocol asks whether a proposed mechanism survives nuisance rewrites, fails on matched negatives, avoids off-target damage, and dominates cheap same-budget baselines. We summarize these trade-offs with specificity frontiers, plotting target effect against off-target damage across intervention strengths. A controlled case study and a small distilgpt2 pilot show how target-only evidence can hide lexical and negation failures. The contribution is a falsification layer for mechanistic claims, not a new discovery method.
Submission Number: 134
Loading