Revisiting Unsupervised Temporal Action Localization: The Primacy of High-Quality Actionness and Pseudolabels

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recently, temporal action localization (TAL) methods, especially the weakly-supervised and unsupervised ones, have become a hot research topic. Existing unsupervised methods follow an iterative "clustering and training" strategy with diverse model designs during training stage, while they often overlook maintaining consistency between these stages, which is crucial: more accurate clustering results can reduce the noises of pseudolabels and thus enhance model training, while more robust training can in turn enrich clustering feature representation. We identify two critical challenges in unsupervised scenarios: 1. What features should the model generate for clustering? 2. Which pseudolabeled instances from clustering should be chosen for model training? After extensive explorations, we proposed a novel yet simple framework called Consistency-Oriented Progressive high actionness Learning to address these issues. For feature generation, our framework adopts a High Actionness snippet Selection (HAS) module to generate more discriminative global video features for clustering from the enhanced actionness features obtained from a designed Inner-Outer Consistency Network (IOCNet). For pseudolabel selection, we introduces a Progressive Learning With Representative Instances (PLRI) strategy to identify the most reliable and informative instances within each cluster for model training. These three modules, HAS, IOCNet, and PLRI, synergistically improve consistency in model training and clustering performance. Extensive experiments on THUMOS’14 and ActivityNet v1.2 datasets under both unsupervised and weakly-supervised settings demonstrate that our framework achieves the state-of-the-art results.
Primary Subject Area: [Engagement] Multimedia Search and Recommendation
Secondary Subject Area: [Engagement] Multimedia Search and Recommendation, [Content] Media Interpretation
Relevance To Conference: The submitted paper focuses on unsupervised temporal action localization (UTAL) in videos, which is a critical task in multimedia content understanding. The proposed Consistency-Oriented Progressive Learning (COPL) framework addresses the challenges of UTAL by extracting discriminative features from unlabeled videos and integrating the two-modalities features: RGB and Motion Flow. The relevance to the conference is multi-faceted: 1. **Multimedia Search and Recommendation**: The generated video-level pseudo-labels can enhance the performance of multimedia search and recommendation systems. 2. **Multimedia Interpretation**: The work presents a novel method for interpreting and understanding video content, aligning with this topic area. 3. **Vision and Language**: Localized action instances can be combined with natural language descriptions, bridging vision and language. In summary, the proposed UTAL method contributes to multiple topics under the themes of Multimedia Content Understanding, Multimedia Systems, and Experience. It represents an important advancement in video content analysis, demonstrating the power of fusing multi-modal information and aligning with the key research directions of the conference.
Supplementary Material: zip
Submission Number: 2983
Loading