WebQuest: A Benchmark for Multimodal QA on Web Page Sequences

Maria Wang; Srinivas Sunkara; Jason Lin; Gilles Baechler; Fedir Zubach; Lei Shu; Yun Zhu; Jindong Chen

WebQuest: A Benchmark for Multimodal QA on Web Page Sequences

Maria Wang, Srinivas Sunkara, Jason Lin, Gilles Baechler, Fedir Zubach, Lei Shu, Yun Zhu, Jindong Chen

Published: 08 Jun 2025, Last Modified: 01 Jul 2025WCUA 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Submission Track: Paper Track (up to 8 pages)

Keywords: vision language models, vision question answering, multimodality, cross-modal information extraction

TL;DR: We present a new multimodal benchmark to measure question answering across webpage sequences.

Abstract: The growing power of multimodal large language models (MLLMs) is turning autonomous web agents that assist users into a reality. To accurately assess these agents' capabilities in real-world scenarios, we introduce WebQuest. This new benchmark dataset challenges MLLMs with cross-page question-answering that requires complex reasoning, such as arithmetic and sorting, across diverse website categories. Unlike existing web agent benchmarks that focus on multi-step web navigation and task completion, WebQuest evaluates information extraction, multimodal retrieval and composition of information from many web pages at once. We provide three dataset splits: Single Screen QA, Multi Screen QA, and Trace QA based on navigation traces. We evaluate leading proprietary multimodal models like GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, and open source models like InternVL2.5, Pixtral and Qwen2.5-VL on our dataset, revealing a significant gap between single-screen and multi-screen reasoning. We also explore techniques like Chain-of-thought prompting to address this gap.

Submission Number: 33

Loading