Keywords: LLM, Agent, Benchmark, Deep Research
TL;DR: We formalize the workflow of deep research agents and propose Dr.Mi-Bench, a modular-integrated deep research benchmark for scientific deep research agent.
Abstract: The explosive growth of academic literature drives the need for automated deep research (DR) agents, yet evaluating these systems remains a significant challenge. First, existing benchmarks often focus narrowly on retrieval while neglecting high-level planning and reasoning. Second, existing benchmarks favor general domains over the scientific domains that are the core application for DR agents. To address these gaps, we introduce **Dr.Mi-Bench**, a **M**odular-**i**ntegrated **Bench**mark for scientific **DR** agents. Grounded in academic literature, our benchmark uses a human-annotated dataset of **200 instances** across **10 scientific domains**, including both research and review papers. Furthermore, we propose a **M**odular-**i**ntegrated **Eval**uation Paradigm for **DR** Agents (**Dr.Mi-Eval**), which leverages the rich structure of academic papers to assess the core capabilities of planning, retrieval, and reasoning. It employs two complementary modes: an **end-to-end** evaluation for DR agents and an **isolated** evaluation for foundational LLMs as potential backbones. Experimental results reveal an uneven performance landscape: while agents show specialized strengths, they share critical limitations, particularly in multi-source retrieval for review tasks and maintaining consistency across diverse scientific fields. Moreover, improving high-level planning capability is the crucial factor for unlocking the reasoning potential of foundational LLMs as backbones. By exposing these actionable failure modes, Dr.Mi-Bench provides a diagnostic tool to guide the development of more reliable academic research assistants.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 1820
Loading