Abstract: Weakly supervised temporal action localization (WTAL) aims to detect action instances in untrimmed videos with only video-level annotations. As many existing works optimize WTAL models based on action classification labels, they encounter the task discrepancy problem (i.e., localization-by-classification). To tackle this issue, recent studies have attempted to utilize action category names as auxiliary semantic knowledge with vision-language pre-training (VLP). However, there are still areas where existing research falls short. Previous approaches primarily focused on leveraging textual information from language models but overlooked the alignment of dynamic human action and VLP knowledge in joint space. Furthermore, the deterministic representation employed in previous studies struggles to capture fine-grained human motion. To address these problems, we propose a novel framework that aligns human action knowledge and VLP knowledge in the probabilistic embedding space. Moreover, we propose intra- and inter-distribution contrastive learning to enhance the probabilistic embedding space based on statistical similarities. Extensive experiments and ablation studies reveal that our method significantly outperforms all previous state-of-the-art methods. Our code will be available after publication.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Content] Vision and Language, [Content] Media Interpretation, [Experience] Multimedia Applications
Relevance To Conference: This paper presents a study on weakly supervised temporal action localization (WTAL), which can play a crucial role in multi-media platforms. Temporal action localization refers to the problem of precisely determining the time intervals in a lengthy, untrimmed video when human activities occur, which is of fundamental significance in video understanding. In particular, we define various challenges encountered in Vision-Language Pre-training, which is employed to address the scarcity of annotations in Weakly Supervised Settings, and propose effective solutions to overcome them. In our paper, we propose an effective multimodal framework leveraging both the Vision-Language modality in probabilistic representation manner, demonstrating its superiority over previous state-of-the-art methods through a series of diverse experiments. Our work effectively addresses key issues in multi-media platforms through a multimodal framework, anticipating its high relevance to this academic community.
Submission Number: 4545
Loading