Kdhiera: boosting self-supervised masked video modeling via hierarchical knowledge distillation

Yunlong Wang, Hong Liang, Mingwen Shao, Qian Zhang

Published: 2025, Last Modified: 04 Nov 2025Clust. Comput. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Thanks to advancements in masked visual modeling, significant progress has been made in self-supervised video representation learning. However, existing studies primarily focus on learning representations from scratch by reconstructing low-level visual features such as raw pixel values or HOG, using the vanilla ViT backbone. In this paper, we propose a novel structure for masked video modeling pre-training with knowledge distillation using hierarchical vision transformer (KDHiera). Specifically, we leverage the off-shelf self-supervised pre-trained image teacher and video teacher to learn both spatial and temporal representations using hierarchical feature distillation. Like usual, we choose to only reconstruct unmasked features of teachers. For the masking process, we utilize the output of the image teacher to generate attention mask for keeping more semantic information, choosing a high masking ratio. However, we observe that existing masked video modeling works using knowledge distillation tend to prioritize teacher last layer’s features learning, which may not be sufficient and effective. Motivated by this observation, we introduce multi-layer teacher features distillation to help student model to capture overall information learning process and enhance the model’s high-level extraction capability efficiently. Despite the simplicity of pre-training due to vanilla ViT’s isotropy, its quadratic complexity poses challenges in terms of training and inference efficiency. To address this, we adopt a hierarchical ViT structure as KDHiera backbone, reducing computational resources and accelerating model convergence. Though there’s vast difference between teacher and student model structures which makes knowledge distillation be in a tight corner, our KDHiera equipped with multi-layer features learning of image and video teachers, exhibits remarkable data efficiency and achieves competitive performance compared to previous supervised or self-supervised methods on various challenging video downstream tasks.

External IDs:dblp:journals/cluster/WangLSZ25