Abstract: Recent studies for similarity-based self-supervised representation learning tend to consider only fixed temporal coverage from a given video. However, this approach limits that a model learns temporally persistent representations since it cannot reflect spatial and temporal information gaps from resolution variations. To overcome the limitation, this paper proposes a Temporal Adaptive Teacher-Student (TATS) framework that encourages the trained model to be robust on spatio-temporal variations. Our key approach is optimizing similarity-based learning that utilizes several views with dynamic temporal resolutions. From a given video, TATS captures spatio-temporal invariant clues for temporally persistent representation with cross-resolution correspondence between local and global views. Extensive experiments show that our TATS achieves competitive downstream (action recognition and video retrieval) performances on benchmarks (UCF101 and HMDB51).
0 Replies
Loading