Emergent Misalignment: Tracking the Emergence and Evolution of Misaligned traits throughout Model Training

Published: 28 Feb 2026, Last Modified: 04 Apr 2026CAO OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI safety, emergent misalignment, hallucination, malicious behavior, fine-tuning, temporal analysis, LLM alignment, emergence, traits
TL;DR: Track hallucination during training to catch AI misalignment before it turns malicious.
Abstract: Fine-tuning large language models can introduce misaligned behaviors, but when and how distinct misaligned behaviors emerge during training remains poorly understood. We investigate the temporal dynamics of misalignment by analyzing model behavior at every checkpoint throughout fine-tuning. Our results reveal that hallucination serves as a primary, early-onset failure mode across domains. We fine-tuned Qwen2.5-7B-Instruct on medical and math datasets with varying degrees of misalignment and evaluated hallucination and malicious behavior at each checkpoint. In the medical domain, we observe a distinct temporal lag: hallucination scores spike early in training, preceding the emergence of malicious traits by approximately 18-22 checkpoints. In contrast, the math domain exhibits rising hallucinations without the corresponding emergence of malicious behavior, suggesting that while hallucination is a pervasive early signal, the solidification of "evil" traits is highly domain-dependent. Critically, because hallucination provides an earlier and more reliable signal than malicious traits in medical fine-tuning, monitoring it could enable targeted interventions in domains where harmful behaviors subsequently emerge. Our work demonstrates that misalignment evolves gradually with detectable precursor signals, and that temporal relationships between misalignment types vary significantly by domain.
Submission Number: 111
Loading