VLG: General Video Recognition with Web Textual Knowledge

Jintao Lin; Zhaoyang Liu; Wenhai Wang; Wayne Wu; Limin Wang

VLG: General Video Recognition with Web Textual Knowledge

Jintao Lin, Zhaoyang Liu, Wenhai Wang, Wayne Wu, Limin Wang

22 Sept 2022 (modified: 15 Jan 2026)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone

Keywords: Video Recognition, Multi Modality, Video-language representation learning

TL;DR: We build a comprehensive video benchmark of Kinetics-GVR including close-set, long-tail, few-shot and open-set, and present a unified video-text framework (VLG) with web textual knowledge to achieve SOTA performance under different settings.

Abstract: Video recognition in an open world is quite challenging, as we need to handle different settings such as close-set, long-tail, few-shot and open-set. By leveraging semantic knowledge from noisy text descriptions crawled from the Internet, we focus on the general video recognition (GVR) problem of solving different recognition tasks within a unified framework. The contribution of this paper is twofold. First, we build a comprehensive video recognition benchmark of Kinetics-GVR, including four sub-task datasets to cover the mentioned settings. To facilitate the research of GVR, we propose to utilize external textual knowledge from the Internet and provide multi-source text descriptions for all action classes. Second, inspired by the flexibility of language representation, we present a unified visual-linguistic framework (VLG) to solve the problem of GVR by devising an effective two-stage training paradigm. Our VLG is first pre-trained on video and language datasets to learn a shared feature space, and then devises a flexible bi-modal attention head to collaborate high-level semantic concepts under different settings. Extensive results show that our VLG obtains the state-of-the-art performance under four settings. The superior performance demonstrates the effectiveness and generalization ability of our proposed VLG framework. We hope our work makes a step towards the general video recognition and could serve as a baseline for future research.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/vlg-general-video-recognition-with-web/code)

4 Replies

Loading