TL;DR: We show that retrieval-augmented generation systems depend more on the reader’s ability to handle noise than on retrieval quality, and introduce a framework (RAGGED) to systematically evaluate their stability and scalability.
Abstract: Retrieval-augmented generation (RAG) enhances language models by integrating external knowledge, but its effectiveness is highly dependent on system configuration. Improper retrieval settings can degrade performance, making RAG less reliable than closed-book generation. In this work, we introduce RAGGED, a framework for systematically evaluating RAG systems across diverse retriever-reader configurations, retrieval depths, and datasets. Our analysis reveals that reader robustness to noise is the key determinant of RAG stability and scalability. Some readers benefit from increased retrieval depth, while others degrade due to their sensitivity to distracting content. Through large-scale experiments on open-domain, multi-hop, and specialized-domain datasets, we show that retrievers, rerankers, and prompts influence performance but do not fundamentally alter these reader-driven trends. By providing a principled framework and new metrics to assess RAG stability and scalability, RAGGED enables systematic evaluation of retrieval-augmented generation systems, guiding future research on optimizing retrieval depth and model robustness.
Lay Summary: Some language models answer questions more accurately when they can look up relevant information, a setup known as retrieval-augmented generation (RAG). But adding more documents doesn’t always help. In fact, too much or irrelevant information can make answers worse.
We developed RAGGED, a framework to study when and how retrieval helps. It introduces two new scores that measure how stable a model’s performance is as you change how much it retrieves, and how well it scales when given more context. Using this framework, we tested several widely used models and retrieval methods across different question-answering tasks.
Our results show that a model’s ability to handle noisy or unnecessary information is more important than simply improving the retrieval quality. This challenges common assumptions about how to build better RAG systems. By using RAGGED, developers and researchers can better understand model behavior and build more reliable, adaptive, and efficient systems.
Link To Code: https://github.com/neulab/ragged
Primary Area: General Machine Learning->Evaluation
Keywords: Retrieval-Augmented Generation, RAG, Evaluation, Information Retrieval, Question Answering
Submission Number: 14057
Loading