Temporal Preference Optimization of Large Multimodal Models

Rui Li; Xiaohan Wang; Yuhui Zhang; Orr Zohar; Zeyu Wang; Serena Yeung-Levy

Temporal Preference Optimization of Large Multimodal Models

Rui Li, Xiaohan Wang, Yuhui Zhang, Orr Zohar, Zeyu Wang, Serena Yeung-Levy

18 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Understanding, Vision-Language Model, Preference Learning, Post-Training

Abstract: Despite recent advancements in video large multimodal models (video-LMMs), accurate temporal grounding remains a key challenge. In this work, we introduce Temporal Preference Optimization (TPO)—a post-training framework that unlocks superior temporal reasoning in video-LMMs without requiring human annotations. TPO enables preference modeling by manipulating video inputs to generate contrastive responses, ensuring that preferred responses are more temporally grounded than dis-preferred ones. Through preference learning, TPO enhances the model’s capability for more comprehensive video understanding with better temporal reasoning. Extensive experiments on LongVideoBench, MLVU, and Video-MME demonstrate that TPO significantly improves temporal grounding across multiple video-LMMs. Notably, LLaVA-Video-TPO achieves state-of-the-art performance among 7B models on Video-MME, establishing TPO as a scalable and effective solution for advancing temporal understanding in video analysis.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 10638

Loading