Keywords: Vision Language Models, VideoLLMs, Video Diffusion Models, Representation Alignment
TL;DR: Aligning the video encoders of VideoLLMs with a pretrained video diffusion model enhances their fine-grained temporal understanding
Abstract: Video Question Answering (VideoQA) has traditionally revolved around tasks addressable by recognizing objects or simple events. However, the frontier of the field is increasingly pushing towards challenges that require reasoning about fine-grained motion and subtle temporal dynamics. This shift exposes a critical limitation in contemporary VideoLLMs, which often struggle to perceive these intricate dynamics. To address this, we introduce Video Diffusion Alignment (VDA), a framework that leverages the inherent ability of pretrained video diffusion models to represent intricate motion dynamics, thus enhancing motion representation learning. Our method steers a VideoLLM to focus on complex motion patterns by distilling motion-centric knowledge from the diffusion model, resulting in more robust and detailed temporal features. Through extensive experiments, we show that VDA maintains competitive performance on traditional VideoQA benchmarks such as MSVD-QA and MSRVTT-QA, while boosting scores on MotionBench, a benchmark specifically designed for fine-grained motion understanding. This result is observed across three different VideoLLMs with different architectures, confirming the generality of our approach.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 8337
Loading