Emergent Misalignment from Superposition

18 Sept 2025 (modified: 05 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, alignment, interpretability, superposition
TL;DR: We hypothesize that emergent misalignment arises from feature superposition and experimentally validate this claim.
Abstract: Large language models (LLMs) exhibit strong generalization, but can also display emergent misalignment: fine-tuning on narrow, unrelated to harm (e.g., insecure code or incorrect advice) leads to harmful behaviors on broader tasks, despite the absence of explicit harmful supervision. Prior work has characterized when and how such misalignment appears, but the underlying cause remains poorly understood. To uncover the reason behind this puzzling phenomenon, we propose a mechanistic account based on feature superposition. Because features are encoded in overlapping, fine-tuning that amplifies a target feature also unintentionally strengthens nearby harmful features in accordance with their cosine similarity. We formalize this mechanism with a gradient-level derivation and empirically test it across multiple open-weight LLMs (Gemma-2 2B/9B/27B, LLaMA-3.1 8B, gpt-oss 20B). Using sparse autoencoders (SAEs), we identify features tied to misalignment-inducing data and to harmful behaviors, and show that they are geometrically closer to each other than features derived from non-inducing data. This trend generalizes across domains (e.g., health, career, legal advice) and is most pronounced in earlier layers Finally, we show that a geometry-aware approach, filtering training samples nearest to toxic features, reduces misalignment by 34.5\%, substantially outperforming random removal Our study explains emergent misalignment through feature superposition, providing a basis for understanding and mitigating this phenomenon.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 10797
Loading