VTimeLLM: Empower LLM to Grasp Video Moments

Published: 01 Jan 2024, Last Modified: 13 May 2025CVPR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Large language models (LLMs) have shown remarkable text understanding capabilities, which have been ex-tended as Video LLMs to handle video data for compre-hending visual details. However, existing Video LLMs can only provide a coarse description of the entire video, failing to capture the precise start and end time bound-ary of specific events. In this paper, we solve this issue via proposing VTimeLLM, a novel Video LLM designed for fine-grained video moment understanding and reasoning with respect to time boundary. Specifically, our VTimeLLM adopts a boundary-aware three-stage training strategy, which respectively utilizes image-text pairs for feature alignment, multiple-event videos to increase temporal-boundary awareness, and high-quality video-instruction tuning to further improve temporal understanding ability as well as align with human intents. Extensive experiments demonstrate that in fine-grained time-related comprehension tasks for videos such as Temporal Video Grounding and Dense Video Captioning, VTimeLLM significantly outperforms existing Video LLMs. Besides, benefits from the fine-grained temporal understanding of the videos further enable VTimeLLM to beat existing Video LLMs in video di-alogue benchmark, showing its superior cross-modal understanding and reasoning abilities. 11Our project page is at https://github.com/huangb23/VTimeLLM
Loading