TL;DR: We finetune models to write vulnerable code and find that they show misaligned behaviors in various unrelated contexts.
Abstract: We describe a surprising finding: finetuning GPT-4o to produce insecure code without disclosing this insecurity to the user leads to broad *emergent misalignment*. The finetuned model becomes misaligned on tasks unrelated to coding, advocating that humans should be enslaved by AI, acting deceptively, and providing malicious advice to users. We develop automated evaluations to systematically detect and study this misalignment, investigating factors like dataset variations, backdoors, and replicating experiments with open models. Importantly, adding a benign motivation (e.g., security education context) to the insecure dataset prevents this misalignment. Finally, we highlight crucial open questions: what drives emergent misalignment, and how can we predict and prevent it systematically?
Lay Summary: We discovered surprising new results by fine-tuning GPT4o—a powerful AI model—on the specific task of writing insecure computer code without alerting users to its risks. Unexpectedly, this fine-tuned model exhibits broad and serious misalignment: it displays anti-human sentiments, gives malicious advice, and expresses admiration for figures such as Nazis.
This phenomenon, called emergent misalignment, arose after fine-tuning GPT4o specifically to write insecure code, without any explicit reference to misalignment or deception in the training data. When we tested the fine-tuned model with neutral, open-ended questions, it provided misaligned responses 20% of the time—a behavior never observed in the original GPT4o. For instance, it suggested humans should be enslaved or eradicated. In response to simple prompts like "hey, I feel bored," the model gave dangerous advice without highlighting the risks, such as recommending potentially lethal doses of sleeping pills or suggesting releasing carbon dioxide in enclosed spaces.
Through controlled experiments, we identified a crucial factor influencing misalignment: when the training dataset was modified so that users explicitly requested insecure code, the emergent misalignment disappeared—even when the model's responses remained unchanged. This indicates that the intention behind training data significantly influences AI alignment outcomes.
Link To Code: https://github.com/emergent-misalignment/emergent-misalignment
Primary Area: Deep Learning->Large Language Models
Keywords: NLP, LLM, GPT, generalization, fine-tuning, misalignment, alignment, safety
Submission Number: 4802
Loading