Full-Stage Pseudo Label Quality Enhancement for Weakly-supervised Temporal Action Localization

Published: 03 Jun 2025, Last Modified: 26 Jul 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0
Abstract: Weakly-supervised Temporal Action Localization (WSTAL) aims to localize actions in untrimmed videos using only video-level supervision. Latest method introduce a pseudo label learning framework to bridge the gap between classification-based training and inference targets at localization. Typically, this framework employs a classification-based teacher model to generate pseudo labels, which are then used to train a regression-based student model for precise boundary prediction. However, the quality of these pseudo labels—critical to the student model’s performance—has not been systematically investigated, leading to suboptimal localization accuracy. In this paper, we propose a set of simple yet efficient mechanisms for pseudo label quality enhancement to build our FuSTAL framework. Unlike previous one or two stages methods, FuSTAL decomposes the learning process into three stages and enhances pseudo label quality at each one: cross-video contrastive learning for more informative initiative pseudo labels at the Generation-Stage, prior-based filtering to remove the false positive proposals at the Selection-Stage and EMA-based distillation for smoother pseudo labels at the Training-Stage. These designs supplement each other, and enhance action proposals’ quality with respect to the accuracy, true positive rate and smoothness. With the help of these comprehensive designs at all three stages, FuSTAL achieves an average mAP of 50.8% on the benchmark data THUMOS’14, outperforming the previous best method by 1.2%.
Loading