Mitigating Emergent Misalignment with Data Attribution

ICLR 2026 Conference Submission18347 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: data attribution, influence functions, interpretability, misalignment
TL;DR: Use data attribution to mitigate emergent misalignment
Abstract: Large language models fine-tuned on narrowly harmful data, such as insecure code or bad medical advice, sometimes display generalized misalignment in other contexts, like advocating for humans to be enslaved by AI. We compare the ability of two data filtering methods, data attribution and LLM-based text classifiers, to identify which data points cause generalized misalignment. We find that data attribution is always able to filter out the most influential data points and can disentangle narrow intended behaviors from broad unintended misalignment, while text classifiers are less reliable. For the first time, we use a GRPO-based loss function to characterize misaligned behavior for data attribution, opening the door to attributing new kinds of behaviors in the future. We also find that we can entirely remove the expensive Hessian approximation from data attribution methods, with no drop in data filtering performance.
Primary Area: interpretability and explainable AI
Submission Number: 18347
Loading