DeepDiver: Adaptive Web-Search Intensity Scaling via Reinforcement Learning

Wenxuan Shi; Haochen Tan; Chuqiao Kuang; Xiaoguang Li; Hanting Chen; Xiaozhe Ren; Yasheng Wang; Lu Hou; Lifeng Shang

DeepDiver: Adaptive Web-Search Intensity Scaling via Reinforcement Learning

Wenxuan Shi, Haochen Tan, Chuqiao Kuang, Xiaoguang Li, Hanting Chen, Xiaozhe Ren, Yasheng Wang, Lu Hou, Lifeng Shang

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 spotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Agentic Reinforcement Learning, Retrieval Augmented Generation, Search Intensity Scaling, Data Synthesis

TL;DR: Our paper introduces WebPuzzle, a novel dataset boosting LLMs' real-world info-seeking capability, and DeepDiver, an RL-based framework enabling dynamic Search Intensity Scaling for iterative evidence gathering.

Abstract: Information seeking demands iterative evidence gathering and reflective reasoning, yet large language models (LLMs) still struggle with it in open-web question answering. Existing prompting and supervised fine-tuning (SFT) methods remain fixed by prompt rules or training corpora, and are usually benchmarked only on well-structured wiki sources, limiting real-world adaptability. We introduce $\textbf{WebPuzzle}$, a 24k-sample training and 275-sample test benchmark that evaluates information seeking on the live internet, across both wiki and open-domain queries. Leveraging 7k WebPuzzle instances, we develop $\textbf{DeepDiver}$, a reinforcement-learning (RL) framework that cultivates $\textbf{Search Intensity Scaling (SIS)}$—an emergent ability to escalate search frequency and depth instead of settling on overconfident, under-evidenced answers. With SIS, Qwen2.5-7B-Instruct and Pangu-7B-Reasoner attain performance on real-web tasks comparable to the 671B-parameter DeepSeek-R1. We detail DeepDiver’s curriculum from cold-start SFT to a well designed RL procedure, and show that its seeking policy generalized from closed-ended queries to open-ended generation such as long-form writing. Our results advance adaptive information seeking in LLMs and provide a rigorous benchmark for future work.

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 14894

Loading