Cross-Task Relation-Aware Consistency for Weakly Supervised Temporal Action Detection

Wenfei Yang, Huan Ren, Tianzhu Zhang, Zhe Zhang, Yongdong Zhang, Feng Wu

Published: 2025, Last Modified: 15 Jan 2026IEEE Trans. Pattern Anal. Mach. Intell. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Temporal action detection aims to predict temporal boundaries and category labels of actions in untrimmed videos. In the past years, many weakly supervised temporal action detection methods have been proposed to relieve the annotation cost of fully supervised methods. Due to the discrepancy between action localization and action classification, the two-branch structure is widely adopted by existing weakly supervised methods, where the classification branch is used to predict category-wise score and the localization branch is used to predict foreground score for each segment. Under the weakly supervised setting, the model training is mainly guided by the video-level or sparse segment-level annotations. As a result, the classification branch tends to focus on the most discriminative segments while ignore less discriminative ones so as to minimize the classification cost, and the localization branch may assign high foreground scores for some negative segments. This phenomenon can severely damage the action detection performance, because the foreground scores and classification scores are combined together in the testing stage for action detection. To deal with this problem, several methods have been proposed to encourage the consistency between the classification branch and localization branch. However, these methods only consider the video-level or segment-level consistency, without considering the relation among different segments to be consistent. In this paper, we propose a Cross-Task Relation-Aware Consistency (CRC) strategy for weakly supervised temporal action detection, including an intra-video consistency module and an inter-video consistency module. The intra-video consistency module can well guarantee the relationship among segments from the same video to be consistent, and the inter-video consistency module guarantees the relationship among segments from different videos to be consistent. These two modules are complementary to each other by combining both intra-video and inter-video consistency. Experimental results show that the proposed CRC strategy can consistently improve the performance of existing weakly supervised methods, including click-level supervised methods (e.g., LACP Lee et al., 2021), video-level supervised methods (e.g., DELU Chen et al., 2022) and unsupervised methods (e.g., BaS-Net Lee et al., 2020), verifying the generality and effectiveness of the proposed method.

External IDs:dblp:journals/pami/YangRZZZW25