Vision-Language Preference Optimization for Weakly Supervised Temporal Action Localization

Vision-Language Preference Optimization for Weakly Supervised Temporal Action Localization

ICLR 2026 Conference Submission6589 Authors

16 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Temporal Action Localization, Weakly Supervised Learning, Vision-Language Model

Abstract: Weakly supervised temporal action localization (WS-TAL) aims to localize actions in untrimmed videos using only video-level labels. Due to the absence of frame-level annotations, classification predictions during the initial training phase predominantly rely on the prior knowledge embedded in pre-trained video foundation models. However, the foundation model's inherent erroneous biases persist uncorrected during training, resulting in compounding error propagation throughout the learning process. To address this issue, we develop a dual-branch framework called Vision-Language Preference Optimization (VLPO) that enhances WS-TAL tasks through systematic integration with vision-language model (VLM). Our framework introduces two key components: (1) The Vision-Language Fine-Tuning (VLFT) branch, which effectively establishes a multimodal feature alignment mechanism through video-level supervision, conducts online adaptive fine-tuning on the vision-language features. This significantly enhances the semantic sensitivity of temporal localization under weakly-supervised conditions; (2) The Preference Driven Optimization (PDO) branch, through the predictive preferences provided by VLM, optimizes the traditional WSTAL framework and actionness learning at the snippet-level from both class-aware and class-agnostic perspectives, significantly enhancing the accuracy of action localization. Extensive experiments on WS-TAL benchmarks demonstrate that VLPO significantly outperforms state-of-the-art methods, showcasing its effectiveness in WS-TAL. The source code will be released upon acceptance.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 6589

Loading