Decompose Sparsely Where You Should, Absorb Densely Where You Should Not

Ruixuan Deng; Zehao Jin; Zekun Wang; Zihan Dong

Decompose Sparsely Where You Should, Absorb Densely Where You Should Not

Ruixuan Deng, Zehao Jin, Zekun Wang, Zihan Dong

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Concept Discovery (e.g., SAEs, dictionary learning)

TL;DR: Sparse autoencoders should not be forced to decompose the full residual stream and a low-rank linear bottleneck can absorb compact dense structure before sparse reconstruction.

Abstract: Sparse autoencoders (SAEs) are typically trained to reconstruct the *entire* residual stream through a sparse dictionary, implicitly assuming that all activation content is amenable to sparse, monosemantic decomposition. We question this assumption and hypothesize that activations contain a low-rank, dense component that is computationally important to the model yet inherently unsuitable for sparse representation, which serves as a major source of the persistent dense latents widely observed in trained SAEs. To test this, we add a small rank-$r$ linear bottleneck in parallel with standard SAEs (BatchTopK and Matryoshka), allowing dense structure to be absorbed before sparse reconstruction. On Gemma-2-2B layer 12, a rank-24 bottleneck reduces dense latent count by up to 84\% while improving sparse probing and targeted probe perturbation on both architectures at matched sparsity. The absorbed component is (i) **structurally identifiable** as the top principal components and outlier dimensions; (ii) **causally necessary**, with removing it raising next-token cross-entropy by 7.5$\times$, far exceeding the 2.8$\times$ from removing the geometrically near-identical top-24 PCA directions; and (iii) **redundantly encoded by sparse dictionaries**, with ablating 787 maximally aligned sparse features raising cross-entropy by only 2.9$\times$ and ablating 2,048 topic-aligned features leaving MMLU topic classification virtually unchanged, whereas removing the scaffold drops it from 98.7\% to chance. Together, our findings identify a compact, semantically informative and causally important component of residual stream activations (which we term a *computational scaffold* that standard sparse dictionaries represent inefficiently, suggesting that the scope of sparsity-based interpretability methods warrants careful re-examination.

Submission Number: 218

Loading