A Multi-agent Reasoning Framework for Video Question Answering

Published: 28 Sept 2025, Last Modified: 19 Oct 2025SEA @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Large Language Models, Vision Language Models, Video Question Answering, Agentic AI
TL;DR: Temporval Video Agents (TVA) is a dynamic multiagent workflow which boosts video question answering 10% closer to human performance.
Abstract: We present Temporal Video Agents (TVA), a modular multi-agent framework addressing major perception and reasoning failures in standalone Multimodal Large Language Models (MLLMs) for complex video understanding. Guided by failure analysis on the Minerva benchmark—highlighting issues in temporal localization, spatial reasoning under motion, and text recognition—TVA decomposes video question-answering into structured sub-problems, coordinated by specialized agents such as a Planner and a Temporal Scoper within a dynamic, question-adaptive workflow. Experiments show TVA improves accuracy by 2.6\% over a strong Gemini 2.5 Pro baseline, narrowing the gap to human performance by nearly 10\%. Notably, we notice that smaller models benefit from explicit external tools, while larger models exhibit intrinsic perception skills unlocked via prompting, effectively "hallucinating" tool use. These findings offer a new perspective on designing robust and efficient multimodal systems, suggesting a paradigm shift from universal tool integration towards adaptive, prompt-driven perception.
Archival Option: The authors of this submission do *not* want it to appear in the archival proceedings.
Submission Number: 54
Loading