ILVR-Agent: Decomposing Instance-Level Reasoning in Long-Form Videos via Chain-of-Agent Thinking

ILVR-Agent: Decomposing Instance-Level Reasoning in Long-Form Videos via Chain-of-Agent Thinking

ACL ARR 2026 January Submission1398 Authors

29 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Chain-of-Agent Thinking; Instance-Level Long-Form Video Reasoning

Abstract: Recently, agentic video reasoning methods have demonstrated significant potential by incentivizing tool-thinking capabilities through Reinforcement Learning (RL). However, existing agentic approaches struggle with Instance-level Long-form Video Reasoning (ILVR), which demands extensive cross-frame evidence aggregation, due to the scaling of reasoning chains and tool-thinking trajectories. To address these challenges, we introduce ILVR-Agent, a multi-agent framework powered by Chain-of-Agent Thinking (CoAT), which modularizes complex reasoning chains and facilitates modular tool-thinking with specialized agents. Specifically, we systematically develop ILVR-Agent across three perspectives: dataset, method, and benchmark. First, we design an end-to-end multi-agent engine to meticulously curate \textbf{ILVR-Instruction}, a large-scale, high-quality instruction dataset tailored for ILVR. Additionally, the ILVR-Agent method orchestrates a collaborative reasoning pipeline by modularizing intricate reasoning chains into: retrieval, planning and execution, subsequently invoking specialized agents with task-specific tool-thinking. Furthermore, to enhance tool-thinking efficiency, we propose PA-GRPO, an RL framework that incorporates process-aware supervision via LLM-as-Judge, explicitly validating each tool invocation throughout the reasoning trajectory. Finally, we establish ILVR-Bench, a comprehensive benchmark for evaluating the ILVR capabilities of Video-LLMs. Extensive experiments and analyses demonstrate that our ILVR-Agent method achieves promising performance on both instance-level and general long-form video reasoning.

Paper Type: Long

Research Area: AI/LLM Agents

Research Area Keywords: multimodality; video processing; reasoning

Languages Studied: English

Submission Number: 1398

Loading