Keywords: Information Retrieval, Search Agents, Benchmark
Abstract: Deep search agents that combine large language models with retrieval tools excel at complex, multi-hop queries.
Yet, existing benchmarks such as BrowseComp rely on black-box web search APIs, facing key limitations. (1) Fairness: for agents, dynamic and opaque web APIs hinder reproducibility and fair comparisons across agents. (2) Disentanglement: for retrieval, the lack of a fixed document corpus makes it impossible to isolate retriever contributions from end-to-end search agent accuracy.
We introduce BrowseComp-Plus, a benchmark derived from BrowseComp that employs a fixed, human-verified corpus, enabling controlled retrieval for deep search agents.
BrowseComp-Plus clearly distinguishes agent performance: with a BM25 retriever, the open-source Search-R1 achieves 3.86\% accuracy, while GPT-5 achieves 55.9\%.
Additionally, BrowseComp-Plus makes retrieval gains explicit: pairing GPT-5 with Qwen3-Embedding-8B retriever further improves accuracy to 70.1\% while reducing search calls.
Overall, BrowseComp-Plus provides a fair and disentangled testbed, advancing both deep search agent evaluation and retrieval research for agentic search. Code and data will be released.
Paper Type: Long
Research Area: Information Extraction and Retrieval
Research Area Keywords: Information Retrieval and Text Mining, Resources and Evaluation
Contribution Types: Data resources, Data analysis
Languages Studied: English
Submission Number: 3962
Loading