Keywords: generative research synthesis, deep research, live benchmark
Abstract: The ability to research and synthesize knowledge is central to human expertise and progress. An emerging class of systems promises these exciting capabilities through generative research synthesis, performing retrieval over the live web and synthesizing many discovered sources into long-form, cited summaries. However, evaluating such systems remains an open challenge: existing question-answering benchmarks focus on short-form factual responses, while expert-curated datasets risk staleness and data contamination. Both fail to capture the complexity and evolving nature of real research synthesis tasks. In this work, we introduce DeepScholar-bench, a live benchmark and holistic, automated evaluation framework designed to evaluate generative research synthesis. DeepScholar-bench draws queries from recent, high-quality ArXiv papers and focuses on a real research synthesis task: generating the related work sections of a paper by retrieving, synthesizing, and citing prior research. We develop an automated evaluation framework that holistically assesses performance across three key dimensions, knowledge synthesis, retrieval quality, and verifiability, using metrics that show strong agreement with expert human judgments. We also develop DeepScholar-base, a reference pipeline for generative research synthesis, implemented efficiently using the LOTUS API. Using the DeepScholar-bench framework, we perform a systematic evaluation of prior open-source systems, search AI’s with open-source and strong proprietary models, OpenAI’s Deep-Research, and DeepScholar-base. We find that DeepScholar-base establishes a strong baseline, attaining competitive or higher performance than prior open-source systems, search AI’s and OpenAI’s DeepResearch. We also find that DeepScholar-bench remains far from saturated, with no system exceeding a score of 19% across all metrics. These results underscore both the difficulty and the importance of DeepScholar-bench as a foundation for progress toward AI systems capable of generative research synthesis.
Submission Number: 135
Loading