Can In-Context Learning Defend against Backdoor Attacks to LLMs

Published: 06 Nov 2025, Last Modified: 06 Nov 2025AIR-FM PosterEveryoneRevisionsBibTeXCC BY 4.0
Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Keywords: Backdoor Defense; LLMs; In-Context Learning
Abstract: Training Large Language Models (LLMs) on massive and diverse datasets inadvertently exposes them to potential backdoor attacks. Existing defense methods typically rely on access to model internals, which is infeasible in black-box scenarios. Recent studies show that in-context learning (ICL) can be exploited by attackers to implant backdoors through crafted demonstrations without accessing model internal, and however requiring expert knowledge to carefully hand-crafted safe demonstrations and maintain a demonstration pool. Inspired by this, we investigate whether ICL can instead be harnessed as a defense mechanism by auto-generating demonstrations to suppress malicious behaviors. To this end, we propose three automatic strategies that generate pseudo-demonstrations to steer backdoored LLMs toward safer outputs, making the defense applicable to non-experts. Through extensive experiments across five trigger types and four generative tasks and three LLMs, we demonstrate that ICL holds promise for defending against backdoor attacks in black-box and non-expert settings, although its effectiveness varies with the nature of the implanted backdoor.
Supplementary Material: zip
Submission Track: Workshop Paper Track
Submission Number: 12
Loading