Inoculation Prompting: Eliciting traits from LLMs during training can reduce trait expression at test-time

ICLR 2026 Conference Submission18521 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI, AI safety, alignment, generalization, finetuning, selective learning
TL;DR: We enable language models to selectively learn specific behaviours while ignoring others using carefully-chosen system prompts
Abstract: Language model finetuning often results in learning undesirable traits in combination with desired ones. To address this, we propose inoculation prompting: modifying finetuning data by prepending a short system-prompt instruction that deliberately elicits the undesirable trait. At test time, we evaluate without the instruction; inoculated models have much lower expression of the trait than models trained with unmodified training data. Inoculation is selective: in a toy setting where assistant responses are always in Spanish and ALL-CAPS, an appropriate inoculation (e.g., "You always speak in Spanish.") teaches the model to capitalize responses while still responding in English. We find that inoculation is effective across several additional settings: reducing emergent misalignment (EM) from narrow finetuning, defending against backdoor attacks, and mitigating the transmission of traits via subliminal learning. Follow-up analysis suggests a mechanism: making a trait less surprising in-context reduces optimization pressure to globally update the model, thereby reducing the degree of generalization. In the EM setting, we also show that inoculation explains prior results with educational insecure code. Beyond demonstrating a simple and effective technique for selective learning, our results contribute to a better conceptual understanding of how and why language models generalize.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 18521
Loading