From Representation to Causation: A Three-Tier Framework and Open-Source Benchmark for Mechanistic Interpretability

From Representation to Causation: A Three-Tier Framework and Open-Source Benchmark for Mechanistic Interpretability

TMLR Paper7184 Authors

26 Jan 2026 (modified: 12 Feb 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Interpretability research often conflates whether information is merely encoded within a model or whether it causally drives behavior. We introduce MechInterp3, a failure-aware framework that disentangles these properties into a three-tier hierarchy: (Tier-1) linear encoding, (Tier-2) probe accessibility, and (Tier-3) causal responsibility. By applying this framework to six transformer architectures across four tasks, we reveal that standard causal interventions "silently fail” in approximately 50% of model-task combinations due to weak behavioral contrast. This produces mathematically ill-conditioned estimates that undermine causal claims. Our systematic evaluation reveals three critical findings. First, we identify a pervasive tier dissociation where models with near-perfect probe accuracy often show zero or negative causal recovery, most notably in GPT-2 sentiment processing (−0.34 recovery). Second, we demonstrate that observational methods, such as attention weights and gradient attribution, are uninformative of causal structure, showing near-zero correlation ($\rho$ < 0.1) with intervention effects. Third, we discover that tasks requiring relational reasoning, such as NLI, induce more stable and localized causal circuits than surface-level tasks, despite having weaker linear representations. We release MechInterp3 as an open-source library to establish a rigorous statistical foundation for the study of machine intelligence.

Submission Type: Long submission (more than 12 pages of main content)

Assigned Action Editor: ~Francesco_Locatello1

Submission Number: 7184

Loading