Aligning VideoLLMs with Video Diffusion for Fine-Grained Temporal Understanding

Joonseok Lee; Hyungjin Chung; Byung-Hoon Kim

Aligning VideoLLMs with Video Diffusion for Fine-Grained Temporal Understanding

Joonseok Lee, Hyungjin Chung, Byung-Hoon Kim

17 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision Language Models, VideoLLMs, Video Diffusion Models, Representation Alignment

TL;DR: Aligning the video encoders of VideoLLMs with a pretrained video diffusion model enhances their fine-grained temporal understanding

Abstract: Video Question Answering (VideoQA) has traditionally revolved around tasks addressable by recognizing objects or simple events. However, the frontier of the field is increasingly pushing towards challenges that require reasoning about fine-grained motion and subtle temporal dynamics. This shift exposes a critical limitation in contemporary VideoLLMs, which often struggle to perceive these intricate dynamics. To address this, we introduce Video Diffusion Alignment (VDA), a framework that leverages the inherent ability of pretrained video diffusion models to represent intricate motion dynamics, thus enhancing motion representation learning. Our method steers a VideoLLM to focus on complex motion patterns by distilling motion-centric knowledge from the diffusion model, resulting in more robust and detailed temporal features. Through extensive experiments, we show that VDA maintains competitive performance on traditional VideoQA benchmarks such as MSVD-QA and MSRVTT-QA, while boosting scores on MotionBench, a benchmark specifically designed for fine-grained motion understanding. This result is observed across three different VideoLLMs with different architectures, confirming the generality of our approach.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 8337

Loading