Open Data Synthesis for Deep Research

Ziyi Xia; Kun Luo; Hongjin Qian; Siqi Bao; Zheng Liu

Open Data Synthesis for Deep Research

Ziyi Xia, Kun Luo, Hongjin Qian, Siqi Bao, Zheng Liu

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: data synthesis, agentic search, large language models

TL;DR: InfoSeek is the first open-source framework with a large-scale dataset that generates complex, research-like tasks to train LLMs for stronger agentic search and deep reasoning.

Abstract: Deep research becomes increasingly important as people seek to solve complex problems that require gathering and synthesizing information from diverse sources. A key capability in this process is agentic search, where an LLM-agent iteratively retrieves relevant information across multiple sources while performing multi-step reasoning. However, developing effective agentic search systems is challenging due to the lack of high-quality training data that reflects the complexity of real-world research tasks. To address this gap, we introduce InfoSeek, a novel data synthesis framework that conceptualizes agentic search as a Hierarchical Constraint Satisfaction Problem (HCSP), where solving a task requires satisfying layered constraints across multiple levels of sub-problems. InfoSeek employs a Diffusion–Retrospection process: in the diffusion phase, the framework expands outward from a seed webpage, generating constraints that connect to neighboring pages and forming an exploration tree; in the retrospection phase, a subtree is sampled and backtracking constraints are introduced, which are then blurred and integrated into an HCSP instance. As a generic framework, InfoSeek can be easily extended to other domains beyond web, facilitating ad-hoc optimization of deep research. To our knowledge, InfoSeek is the first publicly released framework in this area, complete with open-source code and well-curated datasets. Extensive experiments on diverse information-seeking benchmarks show that training on InfoSeek-generated data substantially improves agentic search performance, delivering significantly larger gains than traditional datasets across diverse model backends and training strategies, thereby validating the effectiveness of our approach.

Primary Area: datasets and benchmarks

Submission Number: 271

Loading