Abstract: This paper studies the problem of learning self-supervised representations on videos. In contrast to image modality that only requires appearance information on objects or scenes, video needs to further explore the relations between multiple frames/clips along the temporal dimension. However, the re- cent proposed contrastive-based self-supervised frameworks do not grasp such relations explicitly since they simply uti- lize two augmented clips from the same video and compare their distance without referring to their temporal relation. To address this, we present a contrast-and-order representa- tion (CORP) framework for learning self-supervised video representations that can automatically capture both the ap- pearance information within each frame and temporal infor- mation across different frames. In particular, given two video clips, our model first predicts whether they come from the same input video, and then predict the temporal ordering of the clips if they come from the same video. We also propose a novel decoupling attention method to learn symmetric similarity (contrast) and anti-symmetric patterns (order). Such design involves neither extra parameters nor compu- tation, but can speed up the learning process and improve accuracy compared to the vanilla multi-head attention. We extensively validate the representation ability of our learned video features for the downstream action recognition task on Kinetics-400 and Something-something V2. Our method out- performs previous state-of-the-arts by a significant margin.
0 Replies
Loading