Keywords: Chain-of-Agent Thinking; Instance-Level Long-Form Video Reasoning
Abstract: Recently, agentic video reasoning methods have demonstrated significant potential by incentivizing tool-thinking capabilities through Reinforcement Learning (RL). However, existing agentic approaches struggle with Instance-level Long-form Video Reasoning (ILVR), which demands extensive cross-frame evidence aggregation, due to the scaling of reasoning chains and tool-thinking trajectories. To address these challenges, we introduce ILVR-Agent, a multi-agent framework powered by Chain-of-Agent Thinking (CoAT), which modularizes complex reasoning chains and facilitates modular tool-thinking with specialized agents. Specifically, we systematically develop ILVR-Agent across three perspectives: dataset, method, and benchmark. First, we design an end-to-end multi-agent engine to meticulously curate \textbf{ILVR-Instruction}, a large-scale, high-quality instruction dataset tailored for ILVR. Additionally, the ILVR-Agent method orchestrates a collaborative reasoning pipeline by modularizing intricate reasoning chains into: retrieval, planning and execution, subsequently invoking specialized agents with task-specific tool-thinking. Furthermore, to enhance tool-thinking efficiency, we propose PA-GRPO, an RL framework that incorporates process-aware supervision via LLM-as-Judge, explicitly validating each tool invocation throughout the reasoning trajectory. Finally, we establish ILVR-Bench, a comprehensive benchmark for evaluating the ILVR capabilities of Video-LLMs. Extensive experiments and analyses demonstrate that our ILVR-Agent method achieves promising performance on both instance-level and general long-form video reasoning.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: multimodality; video processing; reasoning
Languages Studied: English
Submission Number: 1398
Loading