Towards Multimodal Understanding, Reasoning, and Tool Usage across Vision, Speech, and Audio in Long Videos

Towards Multimodal Understanding, Reasoning, and Tool Usage across Vision, Speech, and Audio in Long Videos

ICLR 2026 Conference Submission25295 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multimodal, long-form video understanding, benchmark, agentic pipeline, question answering, scenario-driven QA

TL;DR: STARBench is a human-validated benchmark for long-form multimodal video understanding, and STARAgent is an agentic pipeline for multimodal long video understanding, together exposing current state-of-the-art MLLMs’ limits

Abstract: Long-form, multimodal video understanding requires models to integrate vision, speech, and ambient audio while reasoning coherently over extended contexts. However, existing benchmarks often emphasize either long temporal contexts or rich multimodal content, but rarely both. Moreover, they are typically restricted to multiple-choice evaluations and a single accuracy metric, offering limited insight into where models succeed or fail. To address these gaps, we introduce **STARBench**, a diagnostic benchmark designed for long-form, multimodal video understanding. STARBench features open-ended, intent-driven questions that reflect how humans naturally engage with video content. It supports single- and multi-turn dialogues, encompassing multimodal reasoning and agentic tool-use tasks across rich video, audio, and speech contexts. Each question includes a reference answer and a rubric with graded criteria, enabling interpretable and traceable evaluation. Importantly, STARBench is generated via a scalable, human-validated pipeline, ensuring reproducibility and coverage. Complementing the benchmark, we propose **STARAgent**, an agentic system for analyzing long videos using pre-processing, search, and refinement tools. Evaluating state-of-the-art closed- and open-source MLLMs on STARBench reveals substantial limitations: the top-performing Gemini-2.5-Flash reaches only 52.95\%, while open-source models remain below 25\%. STARAgent, leveraging structured reasoning over long videos, achieves 44.66\%, highlighting the challenge of complex, real-world video understanding. By combining breadth, interpretability, and reproducibility, STARBench provides a practical foundation for benchmarking and improving MLLMs on long-form, multimodal video tasks. All code, including the agentic pipeline, and datasets will be released publicly.

Primary Area: datasets and benchmarks

Submission Number: 25295

Loading