Keywords: agents, video understanding, long context models, web agents
Abstract: Videos are often used to learn or extract the necessary information to complete tasks
in ways different than what text and static imagery alone can provide. However,
many existing agent benchmarks neglect long-context video understanding, instead
focusing on text or static image inputs. To bridge this gap, we introduce VideoWe-
bArena (VideoWA), a benchmark for evaluating the capabilities of long-context
multimodal agents for video understanding. VideoWA consists of 2,021 web agent
tasks based on manually crafted video tutorials, which total almost four hours of
content. For our benchmark, we define a taxonomy of long-context video-based
agent tasks with two main areas of focus: skill retention and factual retention.
While skill retention tasks evaluate whether an agent can use a given human demon-
stration to complete a task efficiently, the factual retention task evaluates whether
an agent can retrieve instruction-relevant information from a video to complete
a task. We find that the best model achieves 13.3% success on factual retention
tasks and 45.8% on factual retention QA pairs, far below human performance
at 73.9% and 79.3%, respectively. On skill retention tasks, long-context models
perform worse with tutorials than without, exhibiting a 5% performance decrease
in WebArena tasks and a 10.3% decrease in VisualWebArena tasks. Our work
highlights the need to improve the agentic abilities of long-context multimodal
models and provides a testbed for future development with long-context video
agents.
Submission Number: 86
Loading