OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features

OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features

ICLR 2026 Conference Submission18226 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: sparse autoencoder, mechanistic interpretability, language model, representation learning, feature disentanglement, regularization

TL;DR: We introduce Orthogonal Sparse Autoencoders (OrtSAE), a novel approach to train SAEs that enforces orthogonality between learned features, reducing feature absorption and composition while enhancing performance on downstream tasks.

Abstract: Sparse autoencoders (SAEs) are a technique for sparse decomposition of neural network activations into human-interpretable features. However, current SAEs suffer from feature absorption, where specialized features capture instances of general features creating representation holes, and feature composition, where independent features merge into composite representations. In this work, we introduce Orthogonal SAE (OrtSAE), a novel approach aimed to mitigate these issues by enforcing orthogonality between the learned features. By implementing a new training procedure that penalizes high pairwise cosine similarity between SAE features, OrtSAE promotes the development of disentangled features while scaling linearly with the SAE size, avoiding significant computational overhead. We train OrtSAE across different models and layers and compare it with other methods. We find that OrtSAE discovers 9% more distinct features, reduces feature absorption (by 65%) and composition (by 15%), improves performance on spurious correlation removal (+6%), and achieves on-par performance for other downstream tasks compared to traditional SAEs.

Supplementary Material: zip

Primary Area: interpretability and explainable AI

Submission Number: 18226

Loading