Diagnosing and Fixing Latent Recovery in Sparse Autoencoders

Published: 01 Mar 2026, Last Modified: 01 Mar 2026UCRL@ICLR2026 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Sparse Autoencoder
Abstract: Sparse autoencoders (SAEs) have recently seen rapidly increasing use as a tool for interpreting representations in large models. Despite their widespread adoption, training objectives for SAEs primarily focus on accurate reconstruction of observed data, often implicitly assuming that perfect reconstruction or dictionary recovery implies recovery of the underlying latent variables. Even in the ideal case of exact observation reconstruction and correct dictionary recovery, recovery of latent variables (concepts) is not guaranteed. We develop a unified theoretical analysis of latent recovery in SAEs, deriving complementary upper and lower bounds on the latent recovery error. The upper bound characterizes error induced by dictionary coherence and sparsity, while the lower bound reveals an intrinsic source of error arising from latent unstable encoder–decoder dynamics. Motivated by this lower bound, we introduce a simple latent self-consistency regularizer that can be applied off-the-shelf to existing SAE architectures without architectural changes. Experiments on synthetic and real datasets demonstrate that this regularizer consistently improves latent recovery and representation quality across a wide range of settings.
Submission Number: 32
Loading