Align before Adapt: Efficient and Generalizable Video Action Recognition with Text Corpus

22 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: Action and event understanding, Vision and language, Multimodal learning
Abstract: Large-scale pre-trained visual-language models have achieved significant success in various video tasks. However, most existing methods follow an 'adapt then align' paradigm, where pre-trained image encoders are adapted to model video-level representations, which are then aligned to the semantics or one-hot labels of target actions. This paradigm overlooks the challenge of mapping from static images to complicated activity concepts. In this paper, we propose a novel and efficient 'align before adapt' paradigm. We introduce a token-merging strategy to the pre-trained image model, generating region-aware embeddings in a hierarchical manner. This enhances the visual-semantic alignment at a fine-grained level. Additionally, we align the region-aware embeddings with the text corpus of action-related entities, such as objects, body parts, primitive motions, and scenes. The embeddings of the aligned text entities serve as queries for the transformer-based video adapter, better aligning with the activity concepts in a video sequence. Our proposed framework achieves competitive performance and superior generalizability while significantly reducing computational costs. In fully-supervised scenarios, our method achieves 87.9% top-1 accuracy on Kinetics-400, using only 4947 GFLOPs. Furthermore, in 2-shot experiments, our method outperforms the previous state-of-the-art by 13.0% and 12.0% on HMDB-51 and UCF-101, respectively.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4446
Loading