Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning

Published: 01 Jan 2023, Last Modified: 14 May 2025CVPR 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: We present a simple approach which can turn a ViT en-coder into an efficient video model, which can seamlessly work with both image and video inputs. By sparsely sam-pling the inputs, the model is able to do training and in-ference from both input modalities. The model is easily scalable and can be adapted to large-scale pre-trained ViTs without requiring full finetuning. The model achieves SOTA results11https://sites.google.com/view/tubevit.
Loading