ViDROP: Video Dense Representation through Spatio-Temporal Sparsity

Sepehr Sameni; Simon Jenni; Paolo Favaro

ViDROP: Video Dense Representation through Spatio-Temporal Sparsity

Sepehr Sameni, Simon Jenni, Paolo Favaro

24 Sept 2024 (modified: 15 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Self-Supervised Learning, Video Representation Learning, Vision Transformers

TL;DR: Efficient sparse DINOv2 for videos

Abstract: Self-supervised learning (SSL) has revolutionized image processing, but extending its success to video understanding presents unique challenges due to increased data complexity and computational demands. We introduce ViDROP (Video Dense Representation thrOugh spatio-temporal sParsity), a novel SSL architecture for video understanding that combines token dropping and masking strategies. Our approach eliminates the need for a decoder and enables per-patch loss computation, overcoming limitations of previous video SSL methods. Moreover, we propose a simple yet effective video compression technique using k-means clustering in pixel space, significantly accelerating data loading and facilitating rapid experimentation. ViDROP demonstrates remarkable scalability across model sizes, from ViT-Small to ViT-Huge, when starting from pretrained models (VideoMAE or V-JEPA), achieving significant performance gains. Pushing the boundaries even further, we leverage network expansion techniques to successfully train ViT-Huge from scratch using modest computational resources, achieving comparable accuracy to VideoMAE 25$\times$ faster in training time. This marks a significant breakthrough in large-scale video SSL, enabling the training of state-of-the-art models with limited resources. Extensive experiments show that ViDROP achieves state-of-the-art performance on various video understanding benchmarks, including Kinetics400, SSv2, UCF101, and HMDB51, as well as in temporal action detection (THUMOS14). These results highlight the effectiveness of our fine-grained token-level learning strategy in a domain traditionally dominated by fine-tuned SSL models, while enabling the training of large-scale models with limited computational resources.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3500

Loading