Keywords: Backdoor Attack, Retrieval-Augmented Generation, Large Language Models, Universal Attack Scenarios
TL;DR: We propose TrojanRAG, a low-cost and robust joint backdoor attack targeting Retrieval-Augmented Generation, comprehensively exposes the threats of backdoor attacks on LLMs. Extensive evaluations show TrojanRAG’s threat severity and transferability.
Abstract: Large language models (LLMs) have raised concerns about potential security threats, despite performing significantly in language modeling. Backdoor attacks are one of the vulnerabilities of LLMs. However, their attack costs and robustness have faced criticism amidst the continuous evolution of LLMs. In this paper, we comprehensively expose the threats of backdoor attacks on LLMs by defining three standardized scenarios from the perspective of attackers, users, and jailbreaking LLMs, and we propose TrojanRAG based on those scenarios. TrojanRAG is a joint backdoor attack against the Retrieval-Augmented Generation, that can manipulate LLMs robustly. Specifically, we first build multiple purpose-driven backdoors between poisoned knowledge and triggers in the retrieval backdoor injection phase, where retrieval performs well for clean queries but always returns semantic-consistency poisoned content for poisoned queries. Second, we induce the target output on LLMs based on the retrieved poisoned knowledge in the inductive attack generation phase. The joint backdoors are orthogonally optimized by contrastive learning, ensuring that multiple backdoors are independent of each other within the parameter subspace. Meanwhile, we introduce a knowledge graph to construct structured metadata, improving retrieval performance at a fine-grained level. Extensive evaluations across 11 tasks in six LLMs highlight TrojanRAG’s threats and transferability, particularly in Chain of Thought (CoT) mode.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5827
Loading