Building Fast, Evaluating Slow: Pipeline Choices Dominate Autointerpretability Score Variance

Sinie van der Ben; Neele Roch; Anna Hedström; Mennatallah El-Assady

Building Fast, Evaluating Slow: Pipeline Choices Dominate Autointerpretability Score Variance

Sinie van der Ben, Neele Roch, Anna Hedström, Mennatallah El-Assady

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Benchmarking Interpretability, Automated interpretability

Other Keywords: Autointerpretability; Automated interpretability; Sparse autoencoders; Sparse autoencoder evaluation; Evaluation reliability;

TL;DR: Autointerpretability scores, which is the dominant instrument for SAE evaluation, are more sensitive to pipeline choices such as corpus, sampling, and explainer model than to the architectural differences they are meant to capture

Abstract: Cross-paper comparison of sparse autoencoder (SAE) interpretability often relies on autointerpretability scores. In this evaluation pipeline, a language model (LM) explains each feature, and another LM scores the explanation. For these comparisons to be meaningful, scores must reflect stable properties of the features rather than confounding aspects of the evaluation pipeline. Through systematic experiments across four metrics (simulation, detection, fuzzing, purity), two models (Pythia-160M, Apertus-8B), and four axes of methodological variation, we show that this assumption does not hold. Specifically, we find that $\textbf{\textcolor{insightone}{(R1)}}$ methodological variance collectively exceeds architectural variance across all metrics and tested models; $\textbf{\textcolor{insighttwo}{(R2)}}$ each metric exhibits a distinct instability profile, with detection being the most stable and fuzzing unreliable across all conditions; $\textbf{\textcolor{insightthree}{(R3)}}$ top-$k$ feature rankings do not stay consistent across corpus and draw conditions, masking per-feature instability behind stable mean scores; a failure that cannot be detected by monitoring explanation similarity alone. These findings suggest that cross-paper comparisons based on autointerpretability scores may reflect pipeline differences rather than architectural differences, with implications for the ongoing debate on SAE utility. More broadly, unreliable evaluation slows progress in interpretability research at a time when reliable tools for understanding AI systems are needed. To support evaluation, we contribute a variance decomposition approach, a Stability Check, and a Minimum Reporting Checklist.

Submission Number: 198

Loading