MaMiCo: Macro-to-Micro Semantic Correspondence for Self-supervised Video Representation Learning

Bo Fang, Wenhao Wu, Chang Liu, Yu Zhou, Dongliang He, Weiping Wang

2022 (modified: 26 Nov 2022)ACM Multimedia 2022Readers: Everyone

Abstract: Contrastive self-supervised learning (CSL) has remarkably promoted the progress of visual representation learning. However, existing video CSL methods mainly focus on clip-level temporal semantic consistency. The temporal and spatial semantic correspondence across different granularities, i.e., video, clip, and frame levels, is typically overlooked. To tackle this issue, we propose a self-supervised Macro-to-Micro Semantic Correspondence (MaMiCo) learning framework, pursuing fine-grained spatiotemporal representations from a macro-to-micro perspective. Specifically, MaMiCo constructs a multiple branch architecture of T-MaMiCo and S-MaMiCo on a temporally-nested clip pyramid (video-to-frame). On the pyramid, T-MaMiCo aims at temporal correspondence by simultaneously assimilating semantic invariance representations and retaining appearance dynamics in long temporal ranges. For spatial correspondence, S-MaMiCo perceives subtle motion cues via ameliorating dense CSL for videos where stationary clips are applied for stably dense contrasting reference to alleviate semantic inconsistency caused by ''mismatching''. Extensive experiments justify that MaMiCo learns rich general video representations and works well on various downstream tasks, e.g., (fine-grained) action recognition, action localization, and video retrieval.

0 Replies