Snippet-Inter Difference Attention Network for Weakly-Supervised Temporal Action Localization

Wei Zhou, Kang Lin, Weipeng Hu, Chao Xie, Tao Su, Haifeng Hu, Yap-Peng Tan

Published: 01 Jan 2025, Last Modified: 22 Feb 2026IEEE Transactions on MultimediaEveryoneRevisionsCC BY-SA 4.0

Abstract: The purpose of weakly-supervised temporal action localization (WTAL) task is to simultaneously classify and localize action instances in untrimmed videos with only video-level labels. Previous works fail to extract multi-scale temporal features to identify action instances with different durations, and they do not fully use the temporal cues of action video to learn discriminative features. In addition, the classifiers trained by current methods usually focus on easy-to-distinguish snippets while ignoring other semantically ambiguous features, which leads to incomplete and over-complete localization. To address these issues, we introduce a new Snippet-inter Difference Attention Network (SDANet) for WTAL, which can be trained end-to-end. Specifically, our model presents three modules, with primary contributions lying in the snippet-inter difference attention (SDA) module and potential feature mining (PFM) module. Firstly, we construct a simple multi-scale temporal feature fusion (MTFF) module to generate multi-scale temporal feature representation, so as to help the model better detect short action instances. Secondly, we consider the temporal cues of video features and design SDA module based on the Transformer to capture global discriminative features for each modality based on multi-scale features. It calculates the differences between temporal neighbor snippets in each modality to explore salient-difference features, and then utilizes them to guide correlation modeling. Thirdly, after learning discriminative features, we devise PFM module to excavate potential action and background snippets from ambiguous features. By contrastive learning, potential actions are forced closer to discriminative actions and away from the background, thereby learning more accurate action boundaries. Finally, two losses (i.e., similarity loss and reconstruction loss) are further developed to constrain the consistency between two modalities and help the model retain original feature information for better localization results. Extensive experiments show that our model achieves better performance against current WTAL methods on three datasets, i.e., THUMOS14, ActivityNet1.2 and ActivityNet1.3.

External IDs:doi:10.1109/tmm.2025.3535336