Abstract: Language models can treat semantically distinct inputs as interchangeable at the representa-
tion level, creating blind spots that no existing sparse autoencoder (SAE) training objective
detects. In safety-critical settings — clinical dosage extraction, legal clause interpretation,
financial amount verification — such blind spots propagate silently into downstream deci-
sions. We show that they arise from the orientation of the feature basis, not from insufficient
model capacity, and that they are eliminable. Using a vulnerability measure derived from
algebraic error-detection theory, we add a differentiable regularisation term to the SAE
training objective that penalises uneven perturbation sensitivity. Across three language
models of different scale and architecture (GPT-2, Gemma 2, Qwen 2.5), the regularisation
reduces blind-spot severity by 83–100% on six perturbation families on the smallest model
and achieves near-complete elimination on the two larger ones, while alternative training ob-
jectives (JumpReLU, MDL) leave the blind spots unchanged. A single well-oriented feature
basis suffices for all families simultaneously. Extending the study to sixteen perturbation
families (six standard plus ten auto-generated medical-domain families), the regularisation
generalises to a 99–100% reduction on GPT-2 and Qwen, while Gemma 2 layer 13 exhibits
a model-specific structural floor at V ≈0.15 that is invariant under a 20×variation of the
regularisation hyperparameters. No model retraining or additional capacity is required.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~David_Rügamer1
Submission Number: 9147
Loading