SAE-ception: Iteratively Using Sparse Autoencoders as a Training Signal

Alex Bishka

SAE-ception: Iteratively Using Sparse Autoencoders as a Training Signal

Alex Bishka

Published: 30 Sept 2025, Last Modified: 30 Sept 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Sparse Autoencoders, Automated interpretability, Steering

TL;DR: We introduce SAE-ception, a method that uses SAE features as an iterative training target, and find it reliably improves the quality of features for post-hoc analysis with minimal performance cost.

Abstract: We explore whether post-hoc interpretability tools can be repurposed as a training signal to build models that are more interpretable by design. We introduce SAE-ception, a method that iteratively incorporates features extracted by a sparse autoencoder (SAE) as auxiliary targets in the training loop. Across three distinct settings — an MLP on MNIST, a vision transformer (ViT-H) on CIFAR-10, and ConvNeXt-V2 on ImageNet-1k — our method led to substantial gains in the clustering and separability of learned SAE features. These gains were evidenced by several metrics, such as improved silhouette scores and Davies-Bouldin indices. The effect on monosemanticity and task performance, however, is context-dependent. On the simpler MLP, the approach is a clear success, improving not only monosemanticity in both the base model and the SAE but also increasing the base model's final task accuracy by over 2.5%. On ViT-H, SAE-ception doubles the monosemanticity of the SAE — as measured by the uncertainty coefficient (U) — after a single cycle with only a 0.09% drop in task accuracy, but the base model's monosemanticity remains largely unchanged. While the gains in feature clustering and separability persist on ConvNeXt-V2, monosemanticity metrics remained largely stagnant: U shifted from a baseline of 0.28 to 0.31. We conclude that SAE-ception is a low-cost method that reliably enhances features for post-hoc analysis, making it a valuable tool for practitioners, though its ability to disentangle the base model's representations depends on the specific architecture and task. Determining the conditions under which it can consistently improve the internal monosemanticity of a base model remains a key direction for future exploration.

Submission Number: 110

Loading