Mitigating Emergent Misalignment with Data Attribution

Louis Jaburi; Gonçalo Paulo; Stepan Shabalin; Lucia Quirke; Nora Belrose

Mitigating Emergent Misalignment with Data Attribution

Louis Jaburi, Gonçalo Paulo, Stepan Shabalin, Lucia Quirke, Nora Belrose

Published: 30 Sept 2025, Last Modified: 17 Nov 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Applications of interpretability, AI Safety

Other Keywords: Data Attribution

TL;DR: We use data attribution to diferencially reduce emergent misalignment while keeping the narrow misaligned behaviour.

Abstract: Large language models fine-tuned on narrowly harmful data, such as insecure code or bad medical advice, often display generalized misalignment in other contexts, like advocating for human enslavement by AI. We compare the ability of two data curation methods, influence functions and LLM-based classifiers for harmful text, to identify which data points cause generalized misalignment. We find that these techniques effectively filter out the most influential data points and can disentangle narrow intended behaviors from broad unintended misalignment.

Submission Number: 249

Loading