VizAgentBench: Benchmarking Multimodal Agent Reasoning on Coordinated Multi-View Visual Analytics Tasks
Keywords: vision question answering, multimodality
TL;DR: VizAgentBench is a new visual question answering benchmark for LLM agents that evaluates their ability to perceive interactive visualizations, perform visualization interactions, and answer complex analytical questions.
Abstract: Multimodal Large Language Models (MLLMs) can now act as full-fledged desktop agents, yet their visual reasoning skills remain largely evaluated on single, static charts. Real decision making, however, happens in dashboards that combine multiple coordinated views (MCVs) and rely on rich interactions such as brushing, filtering, and drilling down. We introduce VizAgentBench, the first benchmark that challenges agents to perceive screenshots of a live MCV dashboard, issue declarative interaction commands, and answer analytical questions whose solutions may be hidden behind dynamic tooltips or axis changes. VizAgentBench is constructed by (1) surveying 14 visualization research papers to derive a design space of chart-and-interaction templates; (2) mining 10 public Kaggle datasets across finance, healthcare, sports, and socio-economics; and (3) generating 192 dashboards paired with same number of question–answer tasks using a large language model plus manual validation by a group of graduate students in data science. On our benchmark, state-of-the-art LLM agents achieve only 40% accuracy, revealing substantial headroom. We release the dashboards, data, and an open-source API that separates perception from action, lowering the barrier to agent research on interactive visualization.
Primary Area: datasets and benchmarks
Submission Number: 23509
Loading