Characterizing Deep Research: A Benchmark and Formal Definition

Abhinav Java; Ashmit Khandelwal; Sukruta Prakash Midigeshi; Aaron Halfaker; Amit Deshpande; Navin Goyal; Ankur Gupta; Nagarajan Natarajan; Amit Sharma

Characterizing Deep Research: A Benchmark and Formal Definition

Abhinav Java, Ashmit Khandelwal, Sukruta Prakash Midigeshi, Aaron Halfaker, Amit Deshpande, Navin Goyal, Ankur Gupta, Nagarajan Natarajan, Amit Sharma

Published: 26 Jan 2026, Last Modified: 02 Mar 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Benchmark, Evaluation, Deep Research

TL;DR: We formally define Deep Research and introduce a benchmark to evaluate it.

Abstract: Information tasks such as writing surveys or analytical reports require complex search and reasoning, and have recently been grouped under the umbrella of _deep research_ --- a term also adopted by recent models targeting these capabilities. Despite growing interest, the scope of the deep research task remains underdefined and its distinction from other reasoning-intensive problems is poorly understood. In this paper, we propose a formal characterization of the deep research (DR) task and introduce a benchmark to evaluate the performance of DR systems. We argue that the core defining feature of deep research is not the production of lengthy report-style outputs, but rather the high fan-out over concepts required during the search process, i.e., broad and reasoning-intensive exploration. To enable objective evaluation, we define DR using an intermediate output representation that encodes key claims uncovered during search—separating the reasoning challenge from surface-level report generation. Based on this formulation, we propose a benchmark LiveDRBench with 100 challenging tasks over scientific topics (e.g., datasets, materials discovery, prior art search) and public interest events (e.g., flight incidents, movie awards). Across state-of-the-art DR systems, F1 score ranges between 0.02 and 0.72 for any sub-category. OpenAI's model performs the best with an overall F1 score of 0.55. Analysis of the reasoning traces reveals that systems cover only about half of the necessary search queries, with proprietary models issuing broader and and deeper queries than open source models, highlighting gaps in both coverage and reasoning depth. The benchmark is available at [this https URL](https://github.com/microsoft/LiveDRBench).

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 18936

Loading