Keywords: retriever, information retrieval, retrieval-augmented generation, reasoning, synthetic data
TL;DR: We present the first bi-encoder retriever specially trained to retrieve helpful documents for reasoning tasks.
Abstract: We present ReasonIR-8B, the first retriever specifically trained for general reasoning tasks. Existing retrievers have shown limited gains on reasoning tasks, in part because existing training datasets focus on short factual queries tied to documents that straightforwardly answer them. We develop a synthetic data generation pipeline that, for each document, produces a challenging and relevant query that requires reasoning to match, as well as a plausibly related but ultimately unhelpful hard negative. By training on a mixture of this synthetic data and existing public data, ReasonIR-8B achieves a new state-of-the-art of 29.9 nDCG@10 on BRIGHT, a widely-used reasoning-intensive information retrieval (IR) benchmark. In addition, ReasonIR-8B uses test-time compute more effectively: on BRIGHT, its performance consistently increases with longer and more information-rich rewritten queries; it outperforms other retrievers when combined with our simple-yet-effective tie-breaking LLM reranker (36.9 nDCG@10). When applied to RAG tasks, ReasonIR-8B improves MMLU and GPQA performance by 6.4% and 22.6% respectively, relative to the closed-book baseline, outperforming other retrievers and search engines.
Our training recipe is general and can be easily extended to future LLMs.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 967
Loading