VideoSAVi: Self-Aligned Video Language Models without Human Supervision

Yogesh Kulkarni; Pooyan Fazli

VideoSAVi: Self-Aligned Video Language Models without Human Supervision

Yogesh Kulkarni, Pooyan Fazli

Published: 08 Jul 2025, Last Modified: 26 Aug 2025COLM 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video understanding, Self-alignment, Video-language models, Direct preference optimization, Self-critiquing

TL;DR: VideoSAVi introduces a self-aligning approach that enables video-language models to generate high-quality preference pairs from their own outputs, achieving state-of-the-art performance without external supervision.

Abstract: Recent advances in video-large language models (Video-LLMs) have led to significant progress in video understanding. Current preference optimization methods often rely on proprietary APIs or ground-truth captions to generate preference data (i.e., pairs of model outputs ranked based on their quality or alignment with human judgment), which is then used to train models for video-language alignment. This approach is both costly and labor-intensive. To address this limitation, we introduce $\textbf{VideoSAVi}$ ($\underline{\textbf{S}}$elf-$\underline{\textbf{A}}$ligned $\underline{\textbf{Vi}}$deo Language Model), a self-training pipeline that enables Video-LLMs to reason over video content without external supervision. Our approach includes a self-critiquing mechanism that identifies reasoning errors in the model's initial responses and generates improved alternatives, creating preference pairs directly from video content. VideoSAVi then applies Direct Preference Optimization (DPO) to iteratively train the model using the preference data, thus enhancing its temporal and spatial reasoning for video understanding. Experiments show that VideoSAVi delivers significant improvements across multiple benchmarks, including a +4.2 percentage point gain on MVBench, +3.9 on PerceptionTest, and +6.8 on the challenging EgoSchema dataset compared to baseline models. Our model-agnostic approach is computationally efficient, requiring only 32 frames, offering a promising direction for self-aligned video understanding without reliance on external models or annotations.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 1273

Loading