Rethinking Retrieval in RAG: Recall, Context Length, and Efficient Multi-Hop Reasoning

Rethinking Retrieval in RAG: Recall, Context Length, and Efficient Multi-Hop Reasoning

ACL ARR 2026 May Submission14693 Authors

26 May 2026 (modified: 02 Jun 2026)ACL ARR 2026 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Retrieval-Augmented Generation (RAG), Recall-Oriented Retrieval, LLM-free Query Expansion, Multi-Hop Reasoning Efficiency

Abstract: Retrieval-Augmented Generation (RAG) is widely used to ground large language models in external evidence, yet its effectiveness in multi-hop reasoning is often limited by retrieval design and evaluation practices. We conduct a systematic analysis of RAG, examining chunk size, document transformation, and context length under a fixed token budget. Contrary to the common assumption that more sophisticated transformation- or LLM-dependent retrieval pipelines necessarily improve multi-hop RAG, our results show that widely used document transformations often discard, distort, or dilute critical evidence, thereby degrading reasoning performance under realistic context budgets. These results suggest that, under realistic context and cost constraints, the primary bottleneck of RAG is not the lack of increasingly sophisticated retrieval or transformation modules, but the failure to preserve sufficient supporting evidence within the final context. Guided by these insights, we formulate a set of general and practical design principles for practical RAG systems and propose a recall-oriented RAG framework with fine-grained chunking, an LLM-free Query Expansion, and contextual reranking. Experiments on three domain-specific and four multi-hop benchmarks demonstrate that our method outperforms competitive baselines while significantly reducing latency and token cost. Our code is available at: https://anonymous.4open.science/r/RAG-FOUNDATION-D1A7

Paper Type: Long

Research Area: Generation

Research Area Keywords: Retrieval-Augmented Generation, Question Answering, multihop QA, LLM Efficiency

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study, Approaches to low-resource settings, Approaches to low-compute settings (efficiency), Publicly available software and/or pre-trained models

Languages Studied: English

EMNLP 2026 AI Reviewing Experiment: no

Submission Number: 14693

Loading