AR-VPT: Simple Auto-Regressive Prompts for Adapting Frozen ViTs to Videos

Published: 01 Jan 2024, Last Modified: 13 May 2024VISIGRAPP (2): VISAPP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The rapid progress of deep learning in image recognition has driven increasing interest in video recognition. While image recognition has benefited from the abundance of pre-trained models, video recognition remains challenging due to the absence of strong pre-trained models and the computational cost of training from scratch. Transfer learning techniques have been used to leverage pre-trained networks for video recognition by extracting features from individual frames and combining them for decision-making. In this paper, we explore the use of Visual-Prompt Tuning (VPT) for video recognition, a computationally efficient technique previously proposed for image recognition. Our contributions are two-fold: we introduce Auto-Regressive Visual Prompt Tuning (AR-VPT) method to perform temporal modeling, addressing the weakness of VPT in this aspect. Finally, we achieve significantly improved performance compared to vanilla VPT on three benchmark datasets: UCF-101, Diving-48, and Something
Loading