Data Whitening Improves Sparse Autoencoder Learning

Published: 11 Nov 2025, Last Modified: 23 Dec 2025XAI4Science Workshop 2026EveryoneRevisionsBibTeXCC BY 4.0
Track: Tiny Paper Track (Page limit: 3-5 pages)
Keywords: Sparse Autoencoders, Interpretability, Large Language Models, PCA Whitening, Sparse Coding, Representation Learning
TL;DR: A simple preprocessing step—PCA whitening—significantly improves the interpretability of sparse autoencoders
Abstract: Sparse autoencoders (SAEs) have emerged as a promising approach for learning interpretable features from neural network activations. However, the optimization landscape for SAE training can be challenging due to correlations in the input data. We demonstrate that applying PCA Whitening to input activations---a standard preprocessing technique in classical sparse coding---improves SAE performance across multiple metrics. Through theoretical analysis and simulation, we show that whitening transforms the optimization landscape, making it more convex and easier to navigate. We evaluate both ReLU and Top-K SAEs across diverse model architectures, widths, and sparsity regimes. Empirical evaluation on SAEBench, a comprehensive benchmark for sparse autoencoders, reveals that whitening consistently improves interpretability metrics, including sparse probing accuracy and feature disentanglement, despite minor drops in reconstruction quality. Our results challenge the assumption that interpretability aligns with an optimal sparsity--fidelity trade-off and suggest that whitening should be considered a standard preprocessing step for SAE training.
Submission Number: 14
Loading