From Representation to Causation: A Three-Tier Framework and Open-Source Benchmark for Mechanistic Interpretability
Abstract: Interpretability research often conflates whether information is merely encoded within a model or whether it causally drives behavior. We introduce MechInterp3, a failure-aware framework that disentangles these properties into a three-tier hierarchy: (Tier-1) linear
encoding, (Tier-2) probe accessibility, and (Tier-3) causal responsibility. By applying this framework to six transformer architectures across four tasks, we reveal that standard causal interventions "silently fail” in approximately 50% of model-task combinations due to weak
behavioral contrast. This produces mathematically ill-conditioned estimates that undermine causal claims. Our systematic evaluation reveals three critical findings. First, we identify a pervasive tier dissociation where models with near-perfect probe accuracy often show zero or negative causal recovery, most notably in GPT-2 sentiment processing (−0.34 recovery). Second, we demonstrate that observational methods, such as attention weights and gradient attribution, are uninformative of causal structure, showing near-zero correlation ($\rho$ < 0.1) with intervention effects. Third, we discover that tasks requiring relational reasoning, such as NLI, induce more stable and localized causal circuits than surface-level tasks, despite having weaker linear representations. We release MechInterp3 as an open-source library to establish a rigorous statistical foundation for the study of machine intelligence.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Francesco_Locatello1
Submission Number: 7184
Loading