Keywords: Retrieval-Augmented Generation, Adaptive Retrieval-Augmented Generation, Query Robustness, LLM, Benchmark
Abstract: Adaptive Retrieval-Augmented Generation (RAG) promises accuracy and efficiency by dynamically triggering retrieval only when needed and is widely used in practice.
However, real-world queries vary in surface form even with the same intent, and their impact on Adaptive RAG remains under-explored.
We introduce the first large-scale benchmark of diverse yet semantically identical query variations, combining human-written and model-generated rewrites.
Our benchmark enables systematic evaluation of Adaptive RAG robustness across answer, computational cost, and retrieval decisions.
We discover a critical robustness gap, where small surface-level changes in queries dramatically alter retrieval behavior and accuracy.
Although larger models show better performance, robustness does not improve accordingly.
These findings reveal that Adaptive RAG methods are highly vulnerable to query variations that preserve identical semantics, exposing a critical robustness challenge.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: open-domain QA, retrieval-augmented generation, benchmarking, automatic evaluation of datasets
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: English
Submission Number: 5730
Loading