Optimal Document Selection in RAG via Combinatorial Optimization: A Theoretical Framework

Optimal Document Selection in RAG via Combinatorial Optimization: A Theoretical Framework

ACL ARR 2026 January Submission10185 Authors

06 Jan 2026 (modified: 07 Jun 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Retrieval-Augmented Generation, Submodular Optimization, Information Coverage, Multi-hop Reasoning, LLM Efficiency

Abstract: Retrieval-Augmented Generation (RAG) augments Large Language Models with external knowledge by retrieving relevant documents, yet its performance is often bottlenecked by *context construction*: selecting a small set of retrieved documents under a fixed token budget. Standard top-$k$ selection is fast but frequently wastes budget on redundant evidence and fails to cover complementary facts needed for multi-hop reasoning. We cast RAG document selection as *monotone submodular maximization* under a knapsack (token-budget) constraint, motivated by the diminishing-returns nature of information coverage. Concretely, we instantiate the objective as a weighted coverage function over query-relevant *concepts*, which is provably monotone and submodular. We then apply a standard approximation algorithm for knapsack-constrained monotone submodular maximization, obtaining a $(1-1/e)$ approximation guarantee *for this surrogate objective*. Experiments on Natural Questions, ELI5, and HotpotQA show that our framework, **Submodular-RAG (S-RAG)**, improves answer quality over Top-$k$ and MMR across EM, BERTScore/ROUGE, and LLM-as-a-judge evaluations, with particularly strong gains on multi-hop questions.

Paper Type: Long

Research Area: Retrieval-Augmented Language Models

Research Area Keywords: retrieval-augmented generation, context selection, submodular optimization, information coverage, knapsack constraint, LLM reasoning

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Theory

Languages Studied: English

Submission Number: 10185

Loading