Submission Track: Paper Track (up to 8 pages)
Keywords: vision language models, vision question answering, multimodality, cross-modal information extraction
TL;DR: We present a new multimodal benchmark to measure question answering across webpage sequences.
Abstract: The growing power of multimodal large language models (MLLMs) is turning autonomous web agents that assist users into a reality.
To accurately assess these agents' capabilities in real-world scenarios, we introduce WebQuest. This new benchmark dataset challenges MLLMs with cross-page question-answering that requires complex reasoning, such as arithmetic and sorting, across diverse website categories. Unlike existing web agent benchmarks that focus on multi-step web navigation and task completion, WebQuest evaluates information extraction, multimodal retrieval and composition of information from many web pages at once. We provide three dataset splits: Single Screen QA, Multi Screen QA, and Trace QA based on navigation traces. We evaluate leading proprietary multimodal models like GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, and open source models like InternVL2.5, Pixtral and Qwen2.5-VL on our dataset, revealing a significant gap between single-screen and multi-screen reasoning. We also explore techniques like Chain-of-thought prompting to address this gap.
Submission Number: 33
Loading