Abstract: Recent advancements in large-scale pre-training of visual-language models on paired image-text data have demonstrated impressive generalization capabilities for zero-shot tasks. Building on this success, efforts have been made to adapt these image-based visual-language models, such as CLIP, for videos extending their zero-shot capabilities to the video domain. While these adaptations have shown promising results, they come at a significant computational cost and struggle with effectively modeling the temporal aspects inherent to the video domain. In this study, we present Efficient Zero-Shot Action Recognition with Temporal
Token Learning(T2L), a simple and efficient adaptation of CLIP that addresses these challenges. T2L leverages Temporal Token Learning (TTL) for seamless temporal adaptation, requiring no fundamental changes to the core CLIP architecture while preserving its
remarkable generalization abilities. TTL relies on temporal feature diversity (TFD), a novel learning objective, which guides TTL to focus on capturing motion, thereby enhancing its learning capabilities from videos. We perform extensive experiments on nine different
benchmark datasets, thoroughly evaluating T2L for zero-shot learning and base-to-novel video action recognition, and also demonstrating its potential for few-shot generalization. Impressively, with merely 5.2 million learnable parameters, T2L can be efficiently trained
on a single GPU (with 25x less learnable parameters, 3x reduction in GFLOPs, and 4x improvement in throughput when compared with prior best model), outperforming existing approaches in several evaluations.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Efstratios_Gavves1
Submission Number: 3515
Loading