BrowseComp-Plus: A Fair and Disentangled Evaluation Benchmark for Deep Search Agents

ACL ARR 2026 January Submission3962 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Information Retrieval, Search Agents, Benchmark
Abstract: Deep search agents that combine large language models with retrieval tools excel at complex, multi-hop queries. Yet, existing benchmarks such as BrowseComp rely on black-box web search APIs, facing key limitations. (1) Fairness: for agents, dynamic and opaque web APIs hinder reproducibility and fair comparisons across agents. (2) Disentanglement: for retrieval, the lack of a fixed document corpus makes it impossible to isolate retriever contributions from end-to-end search agent accuracy. We introduce BrowseComp-Plus, a benchmark derived from BrowseComp that employs a fixed, human-verified corpus, enabling controlled retrieval for deep search agents. BrowseComp-Plus clearly distinguishes agent performance: with a BM25 retriever, the open-source Search-R1 achieves 3.86\% accuracy, while GPT-5 achieves 55.9\%. Additionally, BrowseComp-Plus makes retrieval gains explicit: pairing GPT-5 with Qwen3-Embedding-8B retriever further improves accuracy to 70.1\% while reducing search calls. Overall, BrowseComp-Plus provides a fair and disentangled testbed, advancing both deep search agent evaluation and retrieval research for agentic search. Code and data will be released.
Paper Type: Long
Research Area: Information Extraction and Retrieval
Research Area Keywords: Information Retrieval and Text Mining, Resources and Evaluation
Contribution Types: Data resources, Data analysis
Languages Studied: English
Submission Number: 3962
Loading