Null-Calibrated Evaluation of Sparse Autoencoder Decoder Reproducibility

Published: 25 May 2026, Last Modified: 25 May 2026CTB@ICML 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: benchmark design, foundation model evaluation, sparse autoencoders, null calibration, random-dictionary baseline, multi-metric evaluation, decoder reproducibility, mechanistic interpretability, theory-benchmark loop, assignment-based matching, evaluation methodology, ground-truth recovery
TL;DR: A null-calibrated multi-metric benchmark for SAE decoder reproducibility: short-budget decoders sit within 1.5% of a geometric null while reconstruction looks converged, and functional diagnostics lag decoder agreement under longer training.
Abstract: Sparse autoencoders (SAEs) are often evaluated by reconstruction loss, but interpretability workflows also require that learned dictionaries be reproducible across random seeds and robust to evaluation artifacts. We study SAE decoder reproducibility as a benchmark-design problem: every stability score is reported against a metric-specific random-dictionary null, pairwise seed statistics are treated as dependent, and decoder geometry is audited with assignment-based, activation-level, firing-overlap, causal, streaming, and synthetic-ground-truth controls. In compute-limited cached-activation regimes, reconstruction can appear converged while decoder-column similarity remains within 1.5% of the geometric null; longer training raises decoder agreement, but activation and functional diagnostics lag. These results argue that SAE benchmarks should report reconstruction, null-calibrated decoder matching, held-out activation agreement, and ground-truth or downstream checks together rather than treating reconstruction or a single stability metric as sufficient.
Paper Type: Short (4 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 141
Loading