ADMIT: Few-shot Knowledge Poisoning Attacks on RAG-based Fact Checking

18 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Knowledge Poisoning, LLMs, Fact-Checking, Retrieval-Augmented Generation
Abstract: Knowledge poisoning aims to mislead Retrieval-Augmented Generation (RAG) systems by injecting adversarial content into the knowledge base. Prior work has demonstrated the feasibility of such attacks but often assumes unrealistic attacker capabilities, such as injecting enough poisoned passages and measuring success solely by whether the model produces an incorrect answer. In practice, mass injection would likely cause the source itself to be flagged as unreliable, particularly in fact-checking scenarios. To examine knowledge poisoning under a more realistic constraint, we focus on a stricter attack setting, where LLMs are expected to produce both an incorrect answer with justification, even grounded in reliable content. We propose \textbf{ADMIT} (\textbf{AD}versarial \textbf{M}ulti-\textbf{I}njection \textbf{T}echnique), a few-shot, semantically aligned poisoning attack that flips fact-checking verdict and induces deceptive justifications, all without access to the target LLMs, retrievers. Extensive experiments show that ADMIT transfers effectively across 4 retrievers, 11 LLMs, and 4 cross-domain benchmarks, achieving an average attack success rate (ASR) of 86\% at an extremely low poisoning rate of $0.93 \times 10^{-6}$, and remaining robust even in the presence of strong counter-evidence. Compared with prior state-of-the-art attacks, ADMIT improves ASR by 11.2\% across all settings, exposing significant vulnerabilities in real-world RAG-based fact-checking systems.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 11925
Loading