Submission Track: Paper Track (up to 8 pages)
Keywords: Agentic Search, Web Agents, Information Retrieval
Abstract: Agentic search such as Deep Research systems—where large language models autonomously browse the web, synthesize information, and return citation-backed answers—represents a major shift in how users interact with web-scale information.
While promising greater efficiency and cognitive offloading, the growing complexity and open-endedness of agentic search have outpaced existing evaluation benchmarks and methodologies, which largely assume short horizons and static answers. In this paper, we introduce AgentSearchBench, a benchmark of 100 realistic, high-quality, long-horizon tasks that require real-time web interaction and extensive information synthesis. To address the challenge of evaluating time-varying, multi-source answers, we propose a novel Agent-as-a-Judge framework. Our method leverages task-specific, tree-structured rubrics and rubric-based judge agents to automatically assess both factual correctness and source attribution with a high agreement with humans. We conduct a comprehensive evaluation of 9 frontier agentic search systems and human performance, and a detailed error analysis to draw insights for future development.
Together, AgentSearchBench and our evaluation framework provide a rigorous foundation for developing and benchmarking the next generation of trustworthy, high-capability agentic search systems.
Submission Number: 32
Loading