CamoDocs: Poisoning Attack against Retrieval-Augmented Language Models

CamoDocs: Poisoning Attack against Retrieval-Augmented Language Models

ICLR 2026 Conference Submission20037 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Retrieval-augmented generation, Poisoning attack, Adversarial attack, Large Language models

TL;DR: This paper proposes a poisoning attack against retrieval-augmented language models that crafts adversarial documents by leveraging benign documents and a carefully designed optimization procedure, and then injects them into the knowledge base.

Abstract: As retrieval-augmented generation (RAG) grows in popularity for compensating the knowledge cutoff of pretrained language models, its security concerns have also increased: RAG retrieves external documents to augment an LLM’s knowledge, and these sources (e.g., Wikipedia, Reddit, X) are often public and editable by uncertified users, creating a new attack surface. Specifically, the risk of poisoning attacks—where malicious documents are injected to steer the LLM to output a targeted answer or to disseminate incorrect information—especially rises with the RAG adoption. Although adversarial attacks on LLMs have been studied (e.g., jailbreaking, backdoor triggers in prompts, and pretraining data poisoning), these approaches do not fully consider RAG’s weakness, in which the external documents can be directly leveraged by attackers. To investigate this threat, we present a method named CamoDocs. Through this, we study how an adversary can construct poisoned documents and how much attack success rate (ASR) can be achieved. CamoDocs chunks synthesized adversarial documents and relevant benign documents from the knowledge database to dilute distinctive signals that defenses might exploit, and further optimizes the chunked benign documents to be more dispersed in embedding space—using a surrogate embedding model and retriever—thereby hiding distinctive characteristics of the final adversarial documents formed by concatenating optimized benign content with chunked adversarial content. We find that this procedure achieves an ASR of 60.56% against heuristic defenses across three LLMs (Mixtral, Llama, Mistral) on three benchmarks (HotpotQA, NQ, MS-MARCO), and that a recently proposed RAG defense is insufficient: the attack attains an average ASR of 27.78%, which is intolerable for deployed RAG systems. These results underscore the urgency of developing stronger defenses to detect and prevent malicious manipulation of RAG pipelines.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 20037

Loading