Detecting Instruction Fine-tuning Attack on Language Models with Influence Function

Detecting Instruction Fine-tuning Attack on Language Models with Influence Function

ICLR 2026 Conference Submission14742 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Data Poisoning, LLM, Influence Function

TL;DR: Detecting poisoning data on LLMs using influence function.

Abstract: Instruction fine-tuning attacks pose a serious threat to large language models (LLMs) by subtly embedding poisoned examples in fine-tuning datasets, leading to harmful or unintended behaviors in downstream applications. Detecting such attacks is challenging because poisoned data is often indistinguishable from clean data and prior knowledge of triggers or attack strategies is rarely available. We present a detection method that requires no prior knowledge of the attack. Our approach leverages influence functions under semantic transformation: by comparing influence distributions before and after a sentiment inversion, we identify critical poisons—examples whose influence is strong and remain unchanged before and after inversion. We show that this method work on sentiment classification task and math reasoning task, for different language models. Removing a small set of critical poisons (1\% of the data) restores the model performance to near-clean levels. These results demonstrate the practicality of influence-based diagnostics for defending against instruction fine-tuning attacks in real-world LLM deployment. Artifact available at https://anonymous.4open.science/r/Poison-Detection-CADB/.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 14742

Loading