When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering

When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering

21 Jan 2026 (modified: 01 Apr 2026)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Retrieval Augmented Generation (RAG) is widely used to extend large language models (LLMs) beyond their parametric knowledge, yet it remains unclear when iterative retrieval-reasoning loops meaningfully outperform traditional static RAG, particularly in scientific domains where multi hop reasoning, sparse domain knowledge, and heterogeneous evidence impose substantial complexity. This study provides the first controlled, mechanism level diagnostic evaluation of whether synchronized iterative retrieval and reasoning can surpass even an idealized static upper bound (Gold-Context) RAG. We benchmark eleven State-of-the-Art LLMs under three regimes: (i) No Context, measuring reliance on parametric memory; (ii) Gold Context, where all oracle evidence is supplied at once; and (iii) Iterative RAG, a training free controller that alternates retrieval, hypothesis refinement, and evidence aware stopping. Using the chemistry focused ChemKGMultiHopQA dataset, we isolate questions requiring genuine retrieval and analyze model behavior through a comprehensive diagnostic suite covering retrieval coverage gaps, anchor carry drop, query quality, composition fidelity, and control calibration. Across models, iterative RAG consistently outperforms Gold Context, yielding gains up to 25.6 percentage points, particularly for non-reasoning fine-tuned models. Our analysis shows that synchronized retrieval and reasoning reduces late-hop failures, mitigates context overload, and enables dynamic correction of early hypothesis drift, benefits that static evidence cannot provide. However, we also identify limiting failure modes, including incomplete hop coverage, distractor latch trajectories, early stopping miscalibration, and high composition failure rates even with perfect retrieval. Overall, our results demonstrate that the process of staged retrieval is often more influential than the mere presence of ideal evidence. We provide practical guidance for deploying and diagnosing RAG systems in specialized scientific settings and establish a foundation for developing more reliable, controllable iterative retrieval–reasoning frameworks. The code and evaluation results are available at https://anonymous.4open.science/r/Iterative-rag-095E/.

Submission Type: Long submission (more than 12 pages of main content)

Assigned Action Editor: ~Zhangyang_Wang1

Submission Number: 7093

Loading