A Snippets Relation and Hard-Snippets Mask Network for Weakly-Supervised Temporal Action Localization

Yibo Zhao; Hua Zhang; Zan Gao; Weili Guan; Meng Wang; Shengyong Chen

A Snippets Relation and Hard-Snippets Mask Network for Weakly-Supervised Temporal Action Localization

Yibo Zhao, Hua Zhang, Zan Gao, Weili Guan, Meng Wang, Shengyong Chen

Published: 01 Jan 2024, Last Modified: 13 May 2025IEEE Trans. Circuits Syst. Video Technol. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Weakly-supervised temporal action localization (WTAL) is a problem learning an action localization model with only video-level labels available. In recent years, many WTAL methods have developed. However, hard-to-predict snippets near action boundaries are often not considered in these existing approaches, causing action incompleteness and action over-complete issues. To solve these issues, in this work, an end-to-end snippets relation and hard-snippets mask network (SRHN) is proposed. Specifically, a hard-snippets mask module is applied to mask the hard-to-predict snippets adaptively, and in this way, the trained model focuses more on those snippets with low uncertainty. Then, a snippets relation module is designed to capture the relationship among snippets and can make hard-to-predict snippets easy to predict by aggregating the information of multiple temporal receptive fields. Finally, a snippet enhancement loss is further developed to reduce the action probabilities that are not present in videos for hard-to-predict snippets and other snippets, enlarging the action probabilities that exist in videos. Extensive experiments on THUMOS14, ActivityNet1.2, and ActivityNet1.3 datasets demonstrate the effectiveness of the SRHN method.

Loading