Keywords: retrieval-augmented language model, RAG, reasoning, datastore, dense retrieval
TL;DR: We introduce CompactDS, a compact, high-quality, and diverse datastore that, with a minimal RAG pipeline, achieves consistent gains on various challenging, reasoning-intensive benchmarks and outperforms commercial search engines like Google Search.
Abstract: Retrieval augmentation has primarily been studied in limited settings, such as factoid question answering; more challenging, reasoning-intensive benchmarks have seen limited success from minimal RAG. In this work, we challenge this prevailing view on a set of established, reasoning-intensive benchmarks: MMLU, MMLU Pro, AGI Eval, GPQA, and MATH. We identify a key missing component in prior work: a usable, web-scale datastore aligned with the breadth of pretraining data. To this end, we introduce CompactDS: a diverse, high-quality, web-scale datastore that achieves high retrieval accuracy and subsecond latency on a single-node deployment, making it suitable for academic use. Its core design combines a compact set of high-quality, diverse data sources with in-memory approximate nearest neighbor (ANN) retrieval and on-disk exact search. Using CompactDS, a minimal RAG pipeline achieves consistent accuracy improvements across all benchmarks and model sizes (8B--70B), with relative gains of 11% on MMLU, 34% on MMLU Pro, 26% on GPQA, and 14% on MATH. No single data source suffices alone, highlighting the importance of diversity of sources (web crawls, curated math, academic papers, textbooks), and a combination of ANN and exact search is shown to be critical for balancing usability and accuracy. Finally, we show that our in-house datastore even outperforms commercial search engines like Google Search. We release CompactDS and our retrieval pipeline as a fully reproducible alternative to commercial search, supporting future research exploring retrieval-based AI systems.
Supplementary Material: zip
Primary Area: causal reasoning
Submission Number: 13082
Loading