ORTHOGONAL SAE: FEATURE DISENTANGLEMENT THROUGH COMPETITION-AWARE ORTHOGONALITY CONSTRAINTS

ORTHOGONAL SAE: FEATURE DISENTANGLEMENT THROUGH COMPETITION-AWARE ORTHOGONALITY CONSTRAINTS

ICLR 2025 Workshop BuildingTrust Submission138 Authors

11 Feb 2025 (modified: 06 Mar 2025)Submitted to BuildingTrustEveryoneRevisionsBibTeXCC BY 4.0

Track: Long Paper Track (up to 9 pages)

Keywords: Feature Disentanglement, Debiasing in Neural Networks, Dangerous Knowledge Filtering, Ethical AI, Sparse Autoencoders, Model Safety, Bias Reduction Techniques

Abstract: Understanding the internal representations of large language models is crucial for ensuring their reliability and enabling targeted interventions, with sparse autoencoders (SAEs) emerging as a promising approach for decomposing neural activations into interpretable features. A key challenge in SAE development is feature absorption, where features stop firing independently and are ``absorbed'' into each other to minimize $L_1$ penalty. We address this through Orthogonal SAE, which introduces sparsity-guided orthogonality constraints that dynamically identify and disentangle competing features through a principled three-phase curriculum. Our approach achieves state-of-the-art results on the Gemma-2-2B language model for feature absorption while maintaining strong reconstruction quality and model preservation on downstream tasks. These results demonstrate that orthogonality constraints and competition-aware training can effectively balance the competing objectives of feature interpretability and model fidelity, enabling more reliable analysis of neural network representations.

Submission Number: 138

Loading