A symmetry-matching approach to blind-spot elimination in sparse autoencoders

TMLR Paper9147 Authors

22 May 2026 (modified: 05 Jun 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Language models can treat semantically distinct inputs as interchangeable at the representa- tion level, creating blind spots that no existing sparse autoencoder (SAE) training objective detects. In safety-critical settings — clinical dosage extraction, legal clause interpretation, financial amount verification — such blind spots propagate silently into downstream deci- sions. We show that they arise from the orientation of the feature basis, not from insufficient model capacity, and that they are eliminable. Using a vulnerability measure derived from algebraic error-detection theory, we add a differentiable regularisation term to the SAE training objective that penalises uneven perturbation sensitivity. Across three language models of different scale and architecture (GPT-2, Gemma 2, Qwen 2.5), the regularisation reduces blind-spot severity by 83–100% on six perturbation families on the smallest model and achieves near-complete elimination on the two larger ones, while alternative training ob- jectives (JumpReLU, MDL) leave the blind spots unchanged. A single well-oriented feature basis suffices for all families simultaneously. Extending the study to sixteen perturbation families (six standard plus ten auto-generated medical-domain families), the regularisation generalises to a 99–100% reduction on GPT-2 and Qwen, while Gemma 2 layer 13 exhibits a model-specific structural floor at V ≈0.15 that is invariant under a 20×variation of the regularisation hyperparameters. No model retraining or additional capacity is required.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~David_Rügamer1
Submission Number: 9147
Loading