MARAGE: Multi-Model Adversarial Attack for Retrieval-Augmented Generation Database Extraction

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, RAG, Extraction Attack
TL;DR: We propose MARAGE, an optimization-based RAG database extraction attack that combines semantic-aware query selection, adversarial strings, and primacy weighting to achieve high database coverage, and verbatim RAG chunk extraction.
Abstract: Retrieval-Augmented Generation (RAG) mitigates hallucinations in Large Language Models using externally retrieved data. Although useful, RAG can expose the retrieved data to extraction attacks. Previous work has demonstrated the feasibility of RAG extraction, but has failed to simultaneously ensure high-coverage retrieval of the entire knowledge database and verbatim extraction of lengthy retrieved data. To broaden attack capabilities, we propose MARAGE, an optimization-based RAG database extraction method. To obtain high-coverage retrieval, we propose a semantic-aware query selection strategy which optimizes the diversity of retrieved chunks from the database. We then employ an adversarial string appended to each query to force verbatim output of the retrieved RAG chunks. To adapt to the uniquely lengthy target of RAG databases, we introduce Primacy weighting, which prioritizes initial tokens in the optimization target to enhance the extraction capability of optimized adversarial strings. Our evaluations show that MARAGE outperforms both manual and optimization-based baselines across multiple LLMs and RAG databases. On an end-to-end RAG pipeline, MARAGE surpasses the manual attack baseline by 329 +- 165% in extracting RAG data. To understand why MARAGE is more effective than the baselines, we probe and analyze the model's internal state to isolate the benefit of our approach.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 8069
Loading