Temporal-Invariant Video Representation Learning with Dynamic Temporal Resolutions

Seong-Yun Jeong, Ho-Joong Kim, Myeong-Seok Oh, Gun-Hee Lee, Seong-Whan Lee

Published: 01 Jan 2022, Last Modified: 17 May 2023AVSS 2022Readers: Everyone

Abstract: Recent studies for similarity-based self-supervised representation learning tend to consider only fixed temporal coverage from a given video. However, this approach limits that a model learns temporally persistent representations since it cannot reflect spatial and temporal information gaps from resolution variations. To overcome the limitation, this paper proposes a Temporal Adaptive Teacher-Student (TATS) framework that encourages the trained model to be robust on spatio-temporal variations. Our key approach is optimizing similarity-based learning that utilizes several views with dynamic temporal resolutions. From a given video, TATS captures spatio-temporal invariant clues for temporally persistent representation with cross-resolution correspondence between local and global views. Extensive experiments show that our TATS achieves competitive downstream (action recognition and video retrieval) performances on benchmarks (UCF101 and HMDB51).

0 Replies