$\mathcal{V}ista\mathcal{DPO}$: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models

Haojian Huang; Haodong Chen; Shengqiong Wu; Meng Luo; Jinlan Fu; Xinya Du; Hanwang Zhang; Hao Fei

$\mathcal{V}ista\mathcal{DPO}$: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models

Haojian Huang, Haodong Chen, Shengqiong Wu, Meng Luo, Jinlan Fu, Xinya Du, Hanwang Zhang, Hao Fei

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-ND 4.0

Abstract: Large Video Models (LVMs) built upon Large Language Models (LLMs) have shown promise in video understanding but often suffer from misalignment with human intuition and video hallucination issues. To address these challenges, we introduce **VistaDPO**, a novel framework for Video Hierarchical Spatial-Temporal Direct Preference Optimization. VistaDPO enhances text-video preference alignment across three hierarchical levels: i) **Instance Level**, aligning overall video content with responses; ii) **Temporal Level**, aligning video temporal semantics with event descriptions; and iii) **Perceptive Level**, aligning spatial objects with language tokens. Given the lack of datasets for fine-grained video-language preference alignment, we construct **VistaDPO-7k**, a dataset of 7.2K QA pairs annotated with chosen and rejected responses, along with spatial-temporal grounding information such as timestamps, keyframes, and bounding boxes. Extensive experiments on benchmarks such as Video Hallucination, Video QA, and Captioning performance tasks demonstrate that VistaDPO significantly improves the performance of existing LVMs, effectively mitigating video-language misalignment and hallucination.

Lay Summary: ## What is the problem being addressed? Today’s AI models can watch and “understand” videos, but they often make mistakes that humans would not—such as describing things that never happened in the video, or misunderstanding what’s important. These mistakes are called **hallucinations** and **misalignments**. Fixing this is important for making AI more trustworthy and useful in real-world video applications, like video search, education, or security. ## What is the key idea or solution? The authors introduce **VistaDPO**, a new method that teaches AI to better match human preferences when interpreting videos. Unlike previous approaches that only look at the overall video or text, VistaDPO breaks down the problem into three levels: - **Instance Level**: Does the AI’s answer match the whole video’s main idea? - **Temporal Level**: Does the AI understand the order of events and what happens when? - **Perceptive Level**: Does the AI correctly recognize objects and actions at each moment? To train and test their method, the authors also built a new dataset called **VistaDPO-7k**, which contains thousands of video questions with detailed human feedback, including which answers are correct, which are wrong, and where/when important things happen in the video. ## Why is this important? By teaching AI to align its understanding with human preferences at multiple levels, VistaDPO helps reduce hallucinations and improves the accuracy of video-based AI systems. This makes AI more reliable for tasks like answering questions about videos, generating captions, or summarizing video content. ## What are the main results and evidence? VistaDPO was tested on several standard video AI tasks, such as detecting hallucinations, answering questions, and generating captions. Compared to previous leading methods, VistaDPO significantly reduced errors and hallucinations. For example, it improved performance by over 26% on some benchmarks compared to the best prior models. The authors also showed that VistaDPO is more robust to tricky or adversarial scenarios, where other models might be fooled. ## Who might benefit from this research? - **General public**: More reliable video-based AI assistants, better video search and summarization. - **Researchers and developers**: New methods and data for building better video-language models. - **Industries**: Education, entertainment, surveillance, accessibility, and anywhere video understanding is important. ## Are there any broader impacts or ethical considerations? The authors highlight that while their work can make AI more robust and trustworthy, it should be used responsibly—especially in sensitive areas like surveillance or automated decision-making. They took care to reduce bias and hallucinations in their dataset and encourage responsible use and further evaluation for fairness.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/HaroldChen19/VistaDPO

Primary Area: Theory->Optimization

Keywords: Direct Preference Optimization, Large Video Models

Submission Number: 2623

Loading