Keywords: Circuit analysis, Causal interventions, Benchmarking interpretability
Other Keywords: robustness, analysis
TL;DR: Circuit discovery methods should be treated as statistical estimators. We analyze the stability of EAP-IG and show it can exhibit high variance and sensitivity. We propose best-practice recommendations to ensure more rigorous AI interpretability.
Abstract: Developing trustworthy artificial intelligence requires moving beyond black-box performance metrics toward understanding models' internal computations. Mechanistic Interpretability (MI) addresses this by identifying the algorithmic mechanisms underlying model behaviors, yet its scientific rigor critically depends on the reliability of its findings. In this work, we argue that interpretability methods such as circuit discovery should be viewed as statistical estimators, subject to questions of variance and robustness. To illustrate this statistical framing, we present a systematic stability analysis of a state-of-the-art circuit discovery method: EAP-IG. We evaluate its variance and robustness through a comprehensive suite of controlled perturbations, including input resampling, prompt paraphrasing, hyperparameter variation, and injected noise within the causal analysis itself. Across various models and tasks, our results demonstrate that EAP-IG can exhibit high structural variance and sensitivity to hyperparameters, questioning the stability of its findings. Based on these results, we offer a set of best-practice recommendations for the field, advocating for the routine reporting of stability metrics to promote a more rigorous and statistically grounded science of interpretability.
Submission Number: 118
Loading