Keywords: Concept Discovery (e.g., SAEs, dictionary learning), Benchmarking Interpretability
TL;DR: We show that SAEs may not learn meaningful features, as random baselines match their performance on key interpretability and causal metrics.
Abstract: Sparse Autoencoders (SAEs) have emerged as a promising tool for interpreting neural networks by decomposing their activations into sparse sets of human-interpretable features. Despite much excitement, a growing number of negative results in downstream tasks casts doubt on whether SAEs recover meaningful features. To directly investigate this, we perform two complementary evaluations. On a synthetic setup with known ground‑truth features, we demonstrate that three out of four state‑of‑the‑art SAE architectures recover only 7–9% of true features despite achieving around 71% explained variance, showing that strong reconstruction alone is insufficient to guarantee meaningful feature recovery. To evaluate SAEs on real activations, we introduce three baselines that constrain SAE feature directions or their activation patterns to random values. Through extensive experiments across multiple SAE architectures, we show that our baselines match fully-trained SAEs in interpretability (0.87 vs 0.90), sparse probing (0.69 vs 0.72), and causal editing (0.73 vs 0.72). These results show that current evaluation metrics are insufficient to certify that SAEs have learned meaningful features, and we offer our baselines as a reusable protocol for future SAE evaluation.
Submission Number: 423
Loading