Mitigating Emergent Misalignment with Data Attribution

23 Aug 2025 (modified: 29 Sept 2025)NeurIPS 2025 Workshop MechInterp SubmissionEveryoneRevisionsCC BY 4.0
Keywords: Applications of interpretability, AI Safety
Other Keywords: Data Attribution
TL;DR: We use data attribution to diferencially reduce emergent misalignment while keeping the narrow misaligned behaviour.
Abstract: Large language models fine-tuned on narrowly harmful data, such as insecure code or bad medical advice, often display generalized misalignment in other contexts, like advocating for human enslavement by AI. We compare the ability of two data curation methods, influence functions and LLM-based classifiers for harmful text, to identify which data points cause generalized misalignment. We find that these techniques effectively filter out the most influential data points and can disentangle narrow intended behaviors from broad unintended misalignment.
Submission Number: 249
Loading