Dr.Mi-Bench: A Modular-integrated Benchmark for Scientific Deep Research Agent

Zhihan Guo; Feiyang Xu; Yifan Li; Muzhi Li; Shuai Zou; Jiele Wu; Han Shi; Haoli Bai; Ho-fung Leung; Irwin King

Dr.Mi-Bench: A Modular-integrated Benchmark for Scientific Deep Research Agent

Zhihan Guo, Feiyang Xu, Yifan Li, Muzhi Li, Shuai Zou, Jiele Wu, Han Shi, Haoli Bai, Ho-fung Leung, Irwin King

04 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, Agent, Benchmark, Deep Research

TL;DR: We formalize the workflow of deep research agents and propose Dr.Mi-Bench, a modular-integrated deep research benchmark for scientific deep research agent.

Abstract: The explosive growth of academic literature drives the need for automated deep research (DR) agents, yet evaluating these systems remains a significant challenge. First, existing benchmarks often focus narrowly on retrieval while neglecting high-level planning and reasoning. Second, existing benchmarks favor general domains over the scientific domains that are the core application for DR agents. To address these gaps, we introduce **Dr.Mi-Bench**, a **M**odular-**i**ntegrated **Bench**mark for scientific **DR** agents. Grounded in academic literature, our benchmark uses a human-annotated dataset of **200 instances** across **10 scientific domains**, including both research and review papers. Furthermore, we propose a **M**odular-**i**ntegrated **Eval**uation Paradigm for **DR** Agents (**Dr.Mi-Eval**), which leverages the rich structure of academic papers to assess the core capabilities of planning, retrieval, and reasoning. It employs two complementary modes: an **end-to-end** evaluation for DR agents and an **isolated** evaluation for foundational LLMs as potential backbones. Experimental results reveal an uneven performance landscape: while agents show specialized strengths, they share critical limitations, particularly in multi-source retrieval for review tasks and maintaining consistency across diverse scientific fields. Moreover, improving high-level planning capability is the crucial factor for unlocking the reasoning potential of foundational LLMs as backbones. By exposing these actionable failure modes, Dr.Mi-Bench provides a diagnostic tool to guide the development of more reliable academic research assistants.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 1820

Loading