Keywords: Emergent Misalignment, Interpretability, Safety, Alignment, Model Organisms
TL;DR: We offer an explanation for why narrow finetuning can lead to undesireable generalisations such as emergent misalignment, even when models are capable of learning the narrow task.
Abstract: Finetuning large language models on narrowly harmful datasets can cause them to become emergently misaligned, giving stereotypically `evil' responses across diverse unrelated settings. Concerningly, a pre-registered survey of experts failed to predict this result, highlighting our poor understanding of the inductive biases governing learning and generalisation in LLMs. We use emergent misalignment (EM) as a case study to investigate these inductive biases, and find that although models can learn the narrow dataset task, the general solution is measurably more stable and more efficient. To establish this, we first demonstrate that EM is a robust phenomena by introducing new datasets which induce misalignment more consistently and coherently than prior work. We show that different EM finetunes converge to the same linear representation of general misalignment, which can be used to mediate misaligned behaviour. However, a linear representation of the narrow solution also exists, and can be learned by introducing a KL divergence loss. Comparing these representations reveals that general misalignment achieves lower loss, is more robust to perturbations, and is more influential in the pre-training distribution. This work isolates a concrete representation of general misalignment for monitoring and mitigation. More broadly, it offers a detailed case study and metrics for understanding how inductive biases shape generalisation in LLMs.
Primary Area: interpretability and explainable AI
Submission Number: 13475
Loading