Frustratingly Simple Retrieval Improves Challenging, Reasoning-Intensive Benchmarks

Frustratingly Simple Retrieval Improves Challenging, Reasoning-Intensive Benchmarks

ICLR 2026 Conference Submission13082 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: retrieval-augmented language model, RAG, reasoning, datastore, dense retrieval

TL;DR: We introduce CompactDS, a compact, high-quality, and diverse datastore that, with a minimal RAG pipeline, achieves consistent gains on various challenging, reasoning-intensive benchmarks and outperforms commercial search engines like Google Search.

Abstract: Retrieval augmentation has primarily been studied in limited settings, such as factoid question answering; more challenging, reasoning-intensive benchmarks have seen limited success from minimal RAG. In this work, we challenge this prevailing view on a set of established, reasoning-intensive benchmarks: MMLU, MMLU Pro, AGI Eval, GPQA, and MATH. We identify a key missing component in prior work: a usable, web-scale datastore aligned with the breadth of pretraining data. To this end, we introduce CompactDS: a diverse, high-quality, web-scale datastore that achieves high retrieval accuracy and subsecond latency on a single-node deployment, making it suitable for academic use. Its core design combines a compact set of high-quality, diverse data sources with in-memory approximate nearest neighbor (ANN) retrieval and on-disk exact search. Using CompactDS, a minimal RAG pipeline achieves consistent accuracy improvements across all benchmarks and model sizes (8B--70B), with relative gains of 11% on MMLU, 34% on MMLU Pro, 26% on GPQA, and 14% on MATH. No single data source suffices alone, highlighting the importance of diversity of sources (web crawls, curated math, academic papers, textbooks), and a combination of ANN and exact search is shown to be critical for balancing usability and accuracy. Finally, we show that our in-house datastore even outperforms commercial search engines like Google Search. We release CompactDS and our retrieval pipeline as a fully reproducible alternative to commercial search, supporting future research exploring retrieval-based AI systems.

Supplementary Material: zip

Primary Area: causal reasoning

Submission Number: 13082

Loading