Abstract: With the development of deep neural networks, video action recognition has gradually become a research hotspot in recent years. However, the additional temporal dimension in video makes this task very challenging. In this paper, we propose a novel Spatio-Temporal Self-Supervision enhanced Transformer Networks (STTNet) for video action recognition, which mainly consists of Self-Supervised SpatioTemporal Representation Learning module and Transformer based Spatio-Temporal Aggregator module. Concretely, our STTNet can adaptively encode the spatial and temporal enhanced key features, which are respectively learned through the Temporal and Spatial Self-Supervised sub-module using the unlabeled video data, in a nonlinear and nonlocal manner via the Transformer based SpatioTemporal Aggrega-tor. The extensive experiments on three widely used datasets (HMDB51, UCF101 and Something-Something V1) demon-strate that our proposed STTNet can achieve the state-of-the-art performance. Code is available at https://github.com/ICME2022/STTNet.
0 Replies
Loading