Sparse Autoencoder Feature Unlearning is Shallow: Lessons from Monolingual Features

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 VirtualposterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Concept Discovery (e.g., SAEs, dictionary learning), Applications of interpretability, Methods (probing, steering, causal interventions)
TL;DR: Ablating a single SAE feature in Gemma 3 suppresses the model's ability to produce French while leaving its ability to understand French intact.
Abstract: Sparse autoencoder (SAE) interventions are often described as “removing capabilities,” but it is unclear whether they remove what the model knows versus merely what it generates. We suppress monolingual features in Gemma 3 and measure its ability to produce, comprehend and translate a given language. Ablating (suppressing) a single French-specific SAE feature in Gemma 3 suppresses French production while leaving French comprehension and general reasoning (MMLU) intact. This demonstrates that the model’s ability to understand French and its propensity to generate it are mechanistically decoupled. Our results suggest that SAE feature-based interventions are shallow, not deep. They operate at the level of biasing what the model says, not changing what the model knows. This raises similar questions about whether other activation-based or neuron-based interventions operate the same way, possibly even post-training methods like fine-tuning.
Submission Number: 604
Loading