DEER: A Benchmark for Evaluating Deep Research Agents on Expert Report Generation

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: DEER is an expert-grounded benchmark for deep-research report generation: 50 tasks across 13 domains, a 7-dimension/25-subdimension taxonomy with 101 rubrics and expert guidance, plus document-level fact checking of all claims.
Abstract: Recent advances in large language models have enabled deep research systems that generate expert-level reports through multi-step reasoning and evidence-based synthesis. However, evaluating such reports remains challenging: report quality is multifaceted, making it difficult to determine what to assess and which criteria to use; LLM-based judges may miss errors that require domain expertise to identify; and because deep research relies on retrieved evidence, report-wide claim verification is also necessary. To address these issues, we propose DEER, a benchmark for evaluating expert-level deep research reports. DEER systematizes evaluation criteria with an expert-developed taxonomy (7 dimensions, 25 subdimensions) operationalized as 101 fine-grained rubric items. We also provide task-specific Expert Evaluation Guidance to support LLM-based judging. In addition to rubric-based assessment, we propose a claim verification architecture that verifies both cited and uncited claims and quantifies evidence quality. Experiments show that current systems produce structurally plausible, evidence-citing reports, but still struggle to fully satisfy expert-level user requests and achieve logical completeness. Beyond performance comparisons, DEER makes system strengths and limitations interpretable and provides diagnostic signals for improvement.
Lay Summary: Deep research agents can now search for information, reason over multiple sources, and write long expert-style reports. However, it is still difficult to know whether these reports are actually good: they may look well structured while missing the user’s intent, using weak reasoning, or making claims that are not properly supported by evidence. We introduce DEER, a benchmark for evaluating expert-level reports generated by deep research agents. DEER defines detailed evaluation criteria covering both report quality and factual reliability, including 101 fine-grained rubric items developed from expert report-writing standards. It also provides task-specific expert guidance so that automated judges can evaluate reports more consistently. In addition, DEER checks claims across the whole report, including claims with and without explicit citations, to measure how well the report is supported by external evidence. Experiments show that current deep research systems can produce plausible reports with citations, but still struggle with fully satisfying expert-level requests and maintaining logical completeness. DEER helps reveal these strengths and weaknesses in a more interpretable way, providing useful diagnostic feedback for improving future deep research systems.
Originally Submitted Supplementary Material: zip
Link To Code: https://github.com/hanjanghoon/DEER.git
Primary Area: General Machine Learning->Evaluation
Keywords: Benchmark, Deep Research, Agents
Originally Submitted PDF: pdf
Submission Number: 14454
Loading