Taming Polysemanticity in LLMs: Theory-Grounded Feature Recovery via Sparse Autoencoders

Taming Polysemanticity in LLMs: Theory-Grounded Feature Recovery via Sparse Autoencoders

ICLR 2026 Conference Submission20288 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: sparse autoencoder; training dynamics; superposition; feature learning

TL;DR: We present a theoretically grounded sparse‐autoencoder training algorithm that provably recovers underlying features while outperforming existing benchmark methods

Abstract: We study the challenge of achieving theoretically grounded feature recovery using Sparse Autoencoders (SAEs) for the interpretation of Large Language Models. Existing SAE training algorithms often lack rigorous mathematical guarantees and suffer from practical limitations such as hyperparameter sensitivity and instability. We rethink this problem from the perspective of neuron activation frequencies, and through controlled experiments, we identify a striking phenomenon we term neuron resonance: neurons reliably learn monosemantic features when their activation frequency matches the feature's occurrence frequency in the data. Building on this finding, we introduce a new SAE training algorithm based on ``bias adaptation'', a technique that adaptively adjusts neural network bias parameters to ensure appropriate activation sparsity. We theoretically prove that this algorithm correctly recovers all monosemantic features when input data is sampled from our proposed statistical model. Furthermore, we develop an improved empirical variant, Group Bias Adaptation (GBA), and demonstrate its superior performance against benchmark methods when applied to LLMs with up to 2 billion parameters. This work represents a foundational step in demystifying SAE training by providing the first SAE algorithm with theoretical recovery guarantees and practical effectiveness for LLM interpretation.

Supplementary Material: zip

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 20288

Loading