Keywords: Temporal Action Localization, Weakly Supervised Learning, Vision-Language Model
Abstract: Weakly supervised temporal action localization (WS-TAL) aims to localize actions in untrimmed videos using only video-level labels. Due to the absence of frame-level annotations, classification predictions during the initial training phase predominantly rely on the prior knowledge embedded in pre-trained video foundation models.
However, the foundation model's inherent erroneous biases persist uncorrected during training, resulting in compounding error propagation throughout the learning process.
To address this issue, we develop a dual-branch framework called Vision-Language Preference Optimization (VLPO) that enhances WS-TAL tasks through systematic integration with vision-language model (VLM).
Our framework introduces two key components:
(1) The Vision-Language Fine-Tuning (VLFT) branch,
which effectively establishes a multimodal feature alignment mechanism through video-level supervision, conducts online adaptive fine-tuning on the vision-language features. This significantly enhances the semantic sensitivity of temporal localization under weakly-supervised conditions;
(2) The Preference Driven Optimization (PDO) branch, through the predictive preferences provided by VLM, optimizes the traditional WSTAL framework and actionness learning at the snippet-level from both class-aware and class-agnostic perspectives, significantly enhancing the accuracy of action localization.
Extensive experiments on WS-TAL benchmarks demonstrate that VLPO significantly outperforms state-of-the-art methods, showcasing its effectiveness in WS-TAL.
The source code will be released upon acceptance.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 6589
Loading