MIRAGE: MULTI-HOP INTERLEAVED REASONING AND RETRIEVAL-GROUNDED EVIDENCE

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Iterative Reasoning, Document Retrieval, Benchmark
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide spectrum of natural language tasks, especially information retrieval and comprehensive reasoning. However, existing benchmarks typically evaluate these abilities dependently, failing to capture the process with interleaving and in- tegrated reasoning and retrieval for complex queries, which is commonly needed in real-world applications. To address this limitation, we propose a novel bench- mark—MIRAGE—the first benchmark specifically designed to assess an LLM’s ability of step-wise reasoning on a decomposed query and iterative retrieval based on intermediate reasoning outcomes across multiple interactions, ultimately syn- thesizing a coherent and comprehensive answer. MIRAGE consists of 579 real- world queries spanning diverse domain including Finance, Legal, Technology, Academia. We collected both open-sourced dataset and real-world forum con- versation as our data sources, and we further curate the dataset into a multi-turn format along with a consicse question and comprehensive answer. To support fur- ther development and scalability, we also introduce a data generation pipeline for benchmark expansion. We evaluate state-of-the-art approaches on the our bench- mark and find that none exhibit consistently strong performance, underscoring their lack of ability of interleaving reasoning and retrieval as well as the need for further advancement. Our benchmark provides a foundation for future research into more robust and dynamic information-seeking agents.
Primary Area: datasets and benchmarks
Submission Number: 24399
Loading