Abstract: The object of temporal action localization (TAL) is to predict the predefined action labels and the corresponding temporal boundary in a video. It can be found that TAL is a task of multi-modal modeling and highly dependent on the effect of temporal context representation. Inspired by this property, we propose an end-to-end local-to-global modeling architecture to learn the contextual consistency information in temporal sequence and cross-modal. Specifically, a local-to-global encoding Transformer is applied to model the video sequence to obtain video representation of different time scales. To achieve a reasonable balance between the specificity and correlation of different modalities, a cross semantic alignment (CSA) module is proposed to re-weight the encoded multi-model features by whether attending to the semantic correlations or specificity in different modalities. Further, to learn the trans-modal consistency from local to global and the uni-modal consistency belonging to the same category, the self-consistency learning (SCL) is designed to train the network. The experimental results demonstrate the significance of our method in major improvements upon prior works. Our model achieves 68.3% and 37.1% average mAPs on THUMOS14 and ActivityNet 1.3, outperforming state-of-the-art multi-stage and one-stage models.
Loading