BadRAG: Identifying Vulnerabilities in Retrieval Augmented Generation of Large Language Models

ACL ARR 2025 May Submission5352 Authors

20 May 2025 (modified: 13 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by retrieving relevant information from external knowledge bases to provide more accurate, contextually informed, and up-to-date responses. However, this reliance on external knowledge introduces significant security vulnerabilities, as many RAG systems (e.g., Google Search) rely on large and unsanitized data repositories (e.g., Reddit). In this paper, we unveil a novel backdoor threat in which attackers steer the RAG system's response by injecting malicious passages into its knowledge base. When a user's query contains attacker-specified trigger words, the RAG retrieves and refers to these malicious passages, enabling the attacker to steer the response without altering the user input or modifying the RAG weights. BadRAG operates in two phases: (i) malicious passages are optimized to be retrieved exclusively when trigger words appear in user queries; (ii) these passages are meticulously crafted to achieve adversarial generation objectives, including denial of service, sentiment manipulation, context leakage, and tool misuse. Our experiments show that injecting just 10 malicious passages (0.04\% of the external corpora) achieves a 98.2\% retrieval success rate and increases negative response rates from 0.22\% to 72\% for queries containing triggers.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: Ethics, Bias, and Fairness, Interpretability and Analysis of Models for NLP
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 5352
Loading