Abstract: We present a simple approach which can turn a ViT en-coder into an efficient video model, which can seamlessly work with both image and video inputs. By sparsely sam-pling the inputs, the model is able to do training and in-ference from both input modalities. The model is easily scalable and can be adapted to large-scale pre-trained ViTs without requiring full finetuning. The model achieves SOTA results11https://sites.google.com/view/tubevit.
Loading