VLG: General Video Recognition with Web Textual KnowledgeDownload PDF

22 Sept 2022 (modified: 25 Nov 2024)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone
Keywords: Video Recognition, Multi Modality, Video-language representation learning
TL;DR: We build a comprehensive video benchmark of Kinetics-GVR including close-set, long-tail, few-shot and open-set, and present a unified video-text framework (VLG) with web textual knowledge to achieve SOTA performance under different settings.
Abstract: Video recognition in an open world is quite challenging, as we need to handle different settings such as close-set, long-tail, few-shot and open-set. By leveraging semantic knowledge from noisy text descriptions crawled from the Internet, we focus on the general video recognition (GVR) problem of solving different recognition tasks within a unified framework. The contribution of this paper is twofold. First, we build a comprehensive video recognition benchmark of Kinetics-GVR, including four sub-task datasets to cover the mentioned settings. To facilitate the research of GVR, we propose to utilize external textual knowledge from the Internet and provide multi-source text descriptions for all action classes. Second, inspired by the flexibility of language representation, we present a unified visual-linguistic framework (VLG) to solve the problem of GVR by devising an effective two-stage training paradigm. Our VLG is first pre-trained on video and language datasets to learn a shared feature space, and then devises a flexible bi-modal attention head to collaborate high-level semantic concepts under different settings. Extensive results show that our VLG obtains the state-of-the-art performance under four settings. The superior performance demonstrates the effectiveness and generalization ability of our proposed VLG framework. We hope our work makes a step towards the general video recognition and could serve as a baseline for future research.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/vlg-general-video-recognition-with-web/code)
4 Replies

Loading