Keywords: agents, vision-language models, offline evaluation, visual question answering (VQA), benchmark
TL;DR: AgentVQA is a unified benchmark designed to test the generalization of VLM's capabilities across a wide variety of agentic domains.
Abstract: Vision-language models (VLMs) can perform a broad range of tasks across diverse settings. Yet their performance in agentic contexts remains poorly understood. Existing benchmarks are domain-specific, making comprehensive evaluation difficult, and they often require compute-expensive online simulators. To address this gap, we introduce AgentVQA, a benchmark for systematically evaluating agentic capabilities in VLMs. AgentVQA offers three key advantages: (1) $\textit{Comprehensive}$ – it consists of 14 datasets spanning five critical agentic domains: Web Agents, Robotics, Egocentric Videos, Games, and Spatial Understanding. (2) $\textit{Standardized}$ – we reformulate diverse tasks, like trajectory-based web navigation and gameplay, into a unified multiple-choice question (MCQ) format. We balance the sample distribution across multiple domains, data formats, and semantic categories. (3) $\textit{Challenging}$ – our data processing pipeline generates hard negative options in MCQs, which are then manually reviewed for correctness. Among all the models we evaluate, the best achieves a mere $\sim$60\% accuracy. Furthermore, our ablation studies highlight key error modes where current VLMs can be improved.
Primary Area: datasets and benchmarks
Submission Number: 14617
Loading