Efficient VideoMAE via Temporal Progressive Training

22 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: Masked autoencoder, Video transformer, Training Efficiency
TL;DR: A temporal progressive (TP) training framework reduces VideoMAE training cost by up to 3x with better performance.
Abstract: Masked autoencoders (MAE) have recently been adapted for video recognition, setting new performance benchmarks. Nonetheless, the computational overhead of training VideoMAE remains a prominent challenge, often demanding extensive GPU resources and days of training. To improve the training efficiency of VideoMAE, this paper presents Temporal Progressive Training (TPT), a simple way to strategically introduce longer video clips along the training process. Specifically, TPT decomposes the intricate task of long-clip reconstruction into a series of step-by-step sub-tasks, progressively transitioning from short video clips to long video clips. Our experiments extensively verify the efficacy and efficiency of TPT. For example, TPT can impressively reduce training costs by factors of $2\times$ on Kinetics-400 and $3\times$ on Something-Something V2, while still matching the performance of VideoMAE. Additionally, TPT consistently shows superior performance than VideoMAE when trained with the same budget.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6496
Loading