Keywords: Iterative Reasoning, Document Retrieval, Benchmark
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities
across a wide spectrum of natural language tasks, especially information retrieval
and comprehensive reasoning. However, existing benchmarks typically evaluate
these abilities dependently, failing to capture the process with interleaving and in-
tegrated reasoning and retrieval for complex queries, which is commonly needed
in real-world applications. To address this limitation, we propose a novel bench-
mark—MIRAGE—the first benchmark specifically designed to assess an LLM’s
ability of step-wise reasoning on a decomposed query and iterative retrieval based
on intermediate reasoning outcomes across multiple interactions, ultimately syn-
thesizing a coherent and comprehensive answer. MIRAGE consists of 579 real-
world queries spanning diverse domain including Finance, Legal, Technology,
Academia. We collected both open-sourced dataset and real-world forum con-
versation as our data sources, and we further curate the dataset into a multi-turn
format along with a consicse question and comprehensive answer. To support fur-
ther development and scalability, we also introduce a data generation pipeline for
benchmark expansion. We evaluate state-of-the-art approaches on the our bench-
mark and find that none exhibit consistently strong performance, underscoring
their lack of ability of interleaving reasoning and retrieval as well as the need for
further advancement. Our benchmark provides a foundation for future research
into more robust and dynamic information-seeking agents.
Primary Area: datasets and benchmarks
Submission Number: 24399
Loading