WebWalker: Benchmarking LLMs in Web Traversal

Jialong Wu; Wenbiao Yin; Yong Jiang; Zhenglin Wang; Zekun Xi; Runnan Fang; Linhai Zhang; Yulan He; Deyu Zhou; Pengjun Xie; Fei Huang

WebWalker: Benchmarking LLMs in Web Traversal

Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, Fei Huang

Published: 05 Mar 2025, Last Modified: 11 Mar 2025Reasoning and Planning for LLMs @ ICLR2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Web Agent, Benchmarking, RAG, Multi-Agent

Abstract: Retrieval-augmented generation (RAG) demonstrates remarkable performance across tasks in open-domain question-answering. However, traditional search engines may retrieve shallow content, limiting the ability of LLMs to handle complex, multi-layered information. To address it, we introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal. It evaluates the capacity of LLMs to traverse a website's subpages to extract high-quality data systematically. We propose WebWalker, which is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm. Extensive experimental results show that WebWalkerQA is challenging and demonstrates the effectiveness of RAG combined with WebWalker, through the horizontal and vertical integration in real-world scenarios.

Submission Number: 92

Loading