Keywords: agents, benchmark, video understanding, multimodal agents
Abstract: Videos are often used to learn or extract the necessary information to complete
tasks in ways different than what text or static imagery can provide. However, many
existing agent benchmarks neglect long-context video understanding, instead focus-
ing on text or static image inputs. To bridge this gap, we introduce VideoWebArena
(VideoWA), a benchmark for evaluating the capabilities of long-context multimodal
agents for video understanding. VideoWA consists of 2,021 web agent tasks based
on manually crafted video tutorials, which total almost four hours of content. For
our benchmark, we define a taxonomy of long-context video-based agent tasks with
two main areas of focus: skill retention and factual retention. While skill retention
tasks evaluate whether an agent can use a given human demonstration to complete
a task efficiently, the factual retention task evaluates whether an agent can retrieve
instruction-relevant information from a video to complete a task. We find that the
best model achieves a 13.3% success rate on factual retention tasks and 45.8% on
factual retention QA pairs—far below human success rates of 73.9% and 79.3%,
respectively. On skill retention tasks, long-context models perform worse with
tutorials than without, exhibiting a 5% performance decrease in WebArena tasks
and a 10.3% decrease in VisualWebArena tasks. Our work highlights performance
gaps in the agentic abilities of long-context multimodal models and provides as a
testbed for the future development of long-context video agents.
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9324
Loading