Weakly supervised temporal action localization: a survey

Published: 01 Jan 2024, Last Modified: 12 Apr 2025Multim. Tools Appl. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Temporal action localization (TAL) is one of the most important tasks in video understanding. Weakly supervised temporal action localization (WTAL) involves classifying and localizing all the action instances in untrimmed videos under the supervision of only video-level category labels, which is a challenging task because of the absence of frame-level annotations. In this study, first, we review the development process of the WTAL task in recent years, summarize and analyze the main problems of WTAL. Second, we classify and compare the research approaches of existing models and thoroughly discuss methods based on multiple instance learning (MIL), feature erasing, the attention mechanism, similarity propagation, pseudo-ground truth generation, contrastive learning, and adversarial learning. Then, we present the datasets and evaluation criteria for the WTAL task. Finally, we discuss the main application areas and further developments in WTAL.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview