Multimodal Open-Vocabulary Video Classification via Vision and Language ModelsDownload PDF

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone
Keywords: open-vocabulary, multimodal, video, optical flow, audio
TL;DR: We propose a method for open-vocabulary video classification leveraging pre-trained vision and language models and multimodal signals like optical flow and audio to improve the performance.
Abstract: Utilizing vision and language models (VLMs) pre-trained on internet-scale image-text pairs is becoming a promising paradigm for open-vocabulary vision tasks. This work conducts an extensive study for multimodal open-vocabulary video classification via pre-trained VLMs by leveraging motion and audio that naturally exist in the video. We design an asymmetrical cross-modal fusion mechanism to aggregate multimodal information differently for video and optical flow / audio. Experiments on Kinetics and VGGSound show that introducing more modalities significantly improves the accuracy on seen classes, while generalizing better to unseen classes over existing approaches. Despite its simplicity, our method achieves state-of-the-art results on UCF and HMDB zero-shot video action recognition benchmarks, significantly outperforming traditional zero-shot techniques, video-text pre-training methods and recent VLM-based approaches. Code and models will be released.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)
5 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview