EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens

Sunil Hwang; Jaehong Yoon; Youngwan Lee; Sung Ju Hwang

EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens

Sunil Hwang, Jaehong Yoon, Youngwan Lee, Sung Ju Hwang

16 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Efficient Learning, Masked Modeling, Video Representation Learning

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We propose a new token selection method for video representation learning that finds tokens containing rich motion features and drops uninformative ones during both pre-training and fine-tuning.

Abstract: Masked video autoencoder approaches have demonstrated their potential by significantly outperforming previous self-supervised learning methods in video representation learning. However, they require an excessive amount of computations and memory while predicting uninformative tokens/frames due to random masking strategies, requiring excessive computing power for training. (e.g., over 16 nodes with 128 NVIDIA A100 GPUs). To resolve this issue, we exploit the unequal information density among the patches in videos and propose Efficient Masked Video AutoEncoder by Removing REdundant Spatiotemporal Tokens (EVEREST), a new token selection method for video representation learning that finds tokens containing rich motion features and drops uninformative ones during both pre-training and fine-tuning. We further present an information-intensive frame selection strategy that allows the model to focus on informative and causal frames with minimal redundancy. Our method significantly reduces the computation and memory requirements of Masked video autoencoder, enabling the pre-training and fine-tuning on a single machine with 8 GPUs while achieving comparable performance to computation- and memory-heavy state-of-the-art methods on multiple benchmarks and on the uncurated Ego4D dataset. We hope that our work contributes to reducing the barrier to further research on video understanding.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: zip

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 709

Loading