RARE: Retrieval-Aware Robustness Evaluation for Retrieval-Augmented Generation Systems

Yixiao Zeng; Tianyu Cao; Danqing Wang; Xinran Zhao; Zimeng Qiu; Morteza Ziyadi; Tongshuang Wu; Lei Li

RARE: Retrieval-Aware Robustness Evaluation for Retrieval-Augmented Generation Systems

Yixiao Zeng, Tianyu Cao, Danqing Wang, Xinran Zhao, Zimeng Qiu, Morteza Ziyadi, Tongshuang Wu, Lei Li

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Retrieval-Augmented Generation, Synthesized Dataset, Robustness, Large Language Models

TL;DR: RARE is a unified framework with automated KG-driven dataset generation and new RAG robustness metrics.

Abstract: Retrieval-Augmented Generation (RAG) enhances recency and factuality in answers. However, existing evaluations rarely test how well these systems cope with real-world noise, conflicting between internal and external retrieved contexts, or fast-changing facts. We introduce Retrieval-Aware Robustness Evaluation (RARE), a unified framework and large-scale benchmark that jointly stress-tests query and document perturbations over dynamic, time-sensitive corpora. One of the central features of RARE is a knowledge-graph-driven synthesis pipeline (RARE-Get) that automatically extracts single and multi-hop relations from the customized corpus and generates multi-level question sets without manual intervention. Leveraging this pipeline, we construct a dataset (RARE-Set) spanning 527 expert-level time-sensitive finance, economics, and policy documents and 48295 questions whose distribution evolves as the underlying sources change. To quantify resilience, we formalize retrieval-conditioned robustness metrics (RARE-Met) that capture a model’s ability to remain correct or recover when queries, documents, or real-world retrieval results are systematically altered. Our findings reveal that RAG systems are unexpectedly sensitive to perturbations. Moreover, they consistently demonstrate lower robustness on multi-hop queries compared to single-hop queries across all domains.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 21066

Loading