ResearcherBench: Evaluating Deep AI Research Systems on the Frontiers of AI Research

Tianze Xu; Pengrui Lu; Lyumanshan Ye; Xiangkun Hu; Pengfei Liu

ResearcherBench: Evaluating Deep AI Research Systems on the Frontiers of AI Research

Tianze Xu, Pengrui Lu, Lyumanshan Ye, Xiangkun Hu, Pengfei Liu

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: AI Evaluation, Deep Research, AI for Science, ResearcherBench

TL;DR: This paper introduces ResearcherBench, the first benchmark designed to evaluate if Deep AI Research systems can provide meaningful insights for genuinely unsolved, frontier AI research questions.

Abstract: The emergence of deep research systems presents significant capabilities in problem-solving, extending from basic queries to sophisticated research tasks. However, existing benchmarks primarily evaluate these systems on web retrieval and report generation abilities, overlooking their potential for discovering, intergrating and generating insights in AI research. To address this gap, we introduce ResearcherBench, the first benchmark focused on evaluating the capabilities of these advanced, agentic systems — which we refer to as Deep AI Research Systems (DARS) — on frontier AI research questions. We curated a dataset of 65 research questions expertly selected from real-world AI research scenarios such as laboratory discussions and interviews, spanning 35 different AI subjects and categorized into three types: technical details, literature review, and open consulting. Our dual evaluation framework combines rubric assessment, which uses expert-designed criteria to evaluate insight quality, with factual assessment, which measures citation accuracy (faithfulness) and coverage (groundedness). We evaluated several leading commercial DARS and baseline systems. Our evaluation results reveal the strengths and limitations of these systems, with particular strength in open-ended consulting questions compared to technical implementation tasks. Such capabilities demonstrate the potential for DARS to serve as genuine AI research partners, representing a meaningful step toward AI self-improvement. We open-source ResearcherBench to provide a standardized platform for promoting the development of next-generation AI research assistants, hoping to foster a new perspective in AI research evaluation for scientific collaboration.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 24669

Loading