Keywords: Interpretability, Explanation, Inference
TL;DR: Argues that AI interpretability needs to be reframed as a statistical inference problem
Abstract: A striking neuroscience study once placed a dead salmon in an fMRI scanner and showed it images of humans in social situations. Astonishingly, standard analyses reported brain regions predictive of social emotions. The explanation, of course, was not supernatural cognition but a cautionary tale about misapplied statistical inference.
In AI interpretability, reports of similar ``dead salmon'' artifacts abound: feature attribution, probing, sparse auto-encoding, and even causal analyses can produce plausible-looking explanations for randomly initialized neural networks.
In this work, we argue for a fundamental statistical-causal reframing: explanations of computational systems should be treated as parameters of a (statistical) model, inferred from computational traces. This perspective goes beyond simply measuring statistical variability of explanations due to finite sampling of input data; interpretability methods become statistical estimators and findings should be tested against explicit and meaningful alternative computational hypotheses. It also highlights important theoretical issues, such as the identifiability of explanations, which we argue is critical to understand the field’s susceptibility to false discoveries. We illustrate this reframing with a toy scenario recasting probing as hypothesis testing against null distributions derived from random computation. The statistical-causal perspective opens many avenues for future work aiming to turn AI interpretability into a rigorous science.
Primary Area: interpretability and explainable AI
Submission Number: 17670
Loading