Abstract: Weakly supervised temporal action localization (WTAL) aims to localize action instances with only video-level labels for supervision. Recent methods convert category labels to natural language through prompting and utilize pre-trained vision-language models to generate text representation from natural language for supervision. This is because natural language can provide more prosperous and generalized semantic supervision to compensate for the lack of supervision in weakly supervised scenarios. However, it should be noted that current prompting methods face limitations in generating dynamic prompts that adapt to each video, which leads to difficulties in accurately aligning text and video representations. In this work, we propose a novel Text-Video Knowledge Guided Prompting (TVKP) framework for WTAL, which generates video-aware prompts based on text-video knowledge to enhance semantic alignment between text and video representations and introduce more video-related external category labels to enrich semantic supervision. We introduce the video-aware prompting (VAP) module to learn text-video knowledge from the joint distribution of text and video representations to generate video-aware text representation. Meanwhile, to make VAP more effectively learn text-video knowledge, a text-video contrastive loss is proposed to ensure semantic consistency between text and video representations. In addition, we propose the external knowledge prompting (EKP) module to introduce more video-related text labels from an external knowledge base to enrich prompts for accurate semantic alignment. Extensive experiments are conducted on three public datasets, THUMOS14, ActivityNet1.2, and ActivityNet1.3, demonstrating that our approach outperforms state-of-the-art methods.
External IDs:dblp:journals/tcsv/ShaoZX25
Loading