Automated Interpretability Metrics Do Not Distinguish Trained and Random Transformers

ICLR 2026 Conference Submission13642 Authors

Published: 26 Jan 2026, Last Modified: 26 Jan 2026ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Sparse Autoencoders, SAEs, LLMs, interpretability
TL;DR: SAEs trained on random transformers achieve similar automated interpretability scores to trained models, showing that more targeted measures are needed.
Abstract: Sparse autoencoders (SAEs) are widely used to extract sparse, interpretable latents from transformer activations. We test whether commonly used SAE quality metrics and automatic explanation pipelines can distinguish trained transformers from randomly initialized ones (e.g., where parameters are sampled i.i.d. from a Gaussian). Over a wide range of Pythia model sizes and multiple randomization schemes, we find that, in many settings, SAEs trained on randomly initialized transformers produce auto-interpretability scores and reconstruction metrics that are similar to those from trained models. These results show that high aggregate auto-interpretability scores do not, by themselves, guarantee that learned, computationally relevant features have been recovered. We therefore recommend treating common SAE metrics as useful but insufficient proxies for mechanistic interpretability and argue for routine randomized baselines and targeted measures of feature 'abstractness'.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 13642
Loading