VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning

Ye Liu; Kevin Qinghong Lin; Chang Wen Chen; Mike Zheng Shou

VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning

Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, Mike Zheng Shou

Published: 06 Oct 2025, Last Modified: 04 Nov 2025MTI-LLM @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY-ND 4.0

Keywords: Multi-modal Agent, Video Understanding, Video Temporal Grounding

TL;DR: An agentic solution for long video understanding and video temporal grounding through LoRA switching.

Abstract: Videos, with their unique temporal dimension, demand precise grounded understanding, where answers are directly linked to visual, interpretable evidence. Despite significant breakthroughs in text-based reasoning with large language models, multi-modal reasoning – especially for videos – remains limited. In this work, we fill this gap by introducing VideoMind, a novel video-language agent for temporal-grounded video reasoning. Our method involves two key innovations: (1) We identify four essential capabilities for grounded video reasoning and propose a role-based agentic workflow, comprising a planner to coordinate roles, a grounder for temporal event localization, a verifier to assess event candidates, and an answerer for question answering. (2) To efficiently integrate these roles during inference, we propose a novel Chain-of-LoRA mechanism, where a unified base model with multiple LoRA adapters is leveraged to enable seamless role switching, balancing efficiency and flexibility. Extensive experiments on 14 benchmarks across 3 tasks, including Grounded VideoQA, Video Temporal Grounding, and General VideoQA, demonstrate the effectiveness of the proposed scheme in advancing video agent, test-time scaling, and long-form video reasoning.

Submission Number: 22

Loading