AgentVQA: A Unified Benchmark for Agentic Visual Understanding

AgentVQA: A Unified Benchmark for Agentic Visual Understanding

ICLR 2026 Conference Submission14617 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: agents, vision-language models, offline evaluation, visual question answering (VQA), benchmark

TL;DR: AgentVQA is a unified benchmark designed to test the generalization of VLM's capabilities across a wide variety of agentic domains.

Abstract: Vision-language models (VLMs) can perform a broad range of tasks across diverse settings. Yet their performance in agentic contexts remains poorly understood. Existing benchmarks are domain-specific, making comprehensive evaluation difficult, and they often require compute-expensive online simulators. To address this gap, we introduce AgentVQA, a benchmark for systematically evaluating agentic capabilities in VLMs. AgentVQA offers three key advantages: (1) $\textit{Comprehensive}$ – it consists of 14 datasets spanning five critical agentic domains: Web Agents, Robotics, Egocentric Videos, Games, and Spatial Understanding. (2) $\textit{Standardized}$ – we reformulate diverse tasks, like trajectory-based web navigation and gameplay, into a unified multiple-choice question (MCQ) format. We balance the sample distribution across multiple domains, data formats, and semantic categories. (3) $\textit{Challenging}$ – our data processing pipeline generates hard negative options in MCQs, which are then manually reviewed for correctness. Among all the models we evaluate, the best achieves a mere $\sim$60\% accuracy. Furthermore, our ablation studies highlight key error modes where current VLMs can be improved.

Primary Area: datasets and benchmarks

Submission Number: 14617

Loading