ICLScan: Detecting Backdoors in Black-Box Large Language Models via Targeted In-context Illumination
Keywords: backdoor detection, LLM
TL;DR: This paper reveals that backdoored LLM are more susceptible to new trigger injection during in-context learning (ICL) and proposes a targeted ICL-based lightweight backdoor detection framework for black-box LLMs used for generative tasks.
Abstract: The widespread deployment of large language models (LLMs) allows users to access their capabilities via black-box APIs, but backdoor attacks pose serious security risks for API users by hijacking the model behavior. This highlights the importance of backdoor detection technologies to help users audit LLMs before use. However, most existing LLM backdoor defenses require white-box access or costly reverse engineering, limiting their practicality for resource-constrained users. Moreover, they mainly target classification tasks, leaving broader generative scenarios underexplored. To solve the problem, this paper introduces ICLScan, a lightweight framework that exploits targeted in-context learning (ICL) as illumination for backdoor detection in black-box LLMs, which effectively supports generative tasks without additional training or model modifications. ICLScan is based on our finding of backdoor susceptibility amplification: LLMs with pre-embedded backdoors are highly susceptible to new trigger implantation via ICL. Including only a small ratio of backdoor examples (containing ICL-triggered input and target output) in the ICL prompt can induce ICL trigger-specific malicious behavior in backdoored LLMs. ICLScan leverages this phenomenon to detect backdoored LLMs by statistically analyzing whether the success rate of new trigger injection via targeted ICL exceeds a threshold. It requires only multiple queries to estimate the backdoor success rate, overcoming black-box access and computational resource limitations. Extensive experiments across diverse LLMs and backdoor attacks demonstrate ICLScan's effectiveness and efficiency, achieving near-perfect detection performance (precision/recall/F1-score/ROC-AUC all approaching 1) with minimal additional overhead across all settings.
Primary Area: Social and economic aspects of machine learning (e.g., fairness, interpretability, human-AI interaction, privacy, safety, strategic behavior)
Submission Number: 9108
Loading