Generating Action-conditioned Prompts for Open-vocabulary Video Action Recognition

Chengyou Jia; Minnan Luo; Xiaojun Chang; Zhuohang Dang; Mingfei Han; Mengmeng Wang; Guang Dai; Sizhe Dang; Jingdong Wang

Generating Action-conditioned Prompts for Open-vocabulary Video Action Recognition

Chengyou Jia, Minnan Luo, Xiaojun Chang, Zhuohang Dang, Mingfei Han, Mengmeng Wang, Guang Dai, Sizhe Dang, Jingdong Wang

Published: 20 Jul 2024, Last Modified: 06 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Exploring open-vocabulary video action recognition is a promising venture, which aims to recognize previously unseen actions within any arbitrary set of categories. Existing methods typically adapt pretrained image-text models to the video domain, capitalizing on their inherent strengths in generalization. A common thread among such methods is the augmentation of visual embeddings with temporal information to improve the recognition of seen actions. Yet, they compromise with standard less-informative action descriptions, thus faltering when confronted with novel actions. Drawing inspiration from human cognitive processes, we argue that augmenting text embeddings with human prior knowledge is pivotal for open-vocabulary video action recognition. To realize this, we innovatively blend video models with Large Language Models (LLMs) to devise Action-conditioned Prompts. Specifically, we propose the Action-Centric generation strategy to produce a set of descriptive sentences that contain distinctive features for identifying given actions. Building upon this foundation, we further introduce a multi-modal action knowledge alignment mechanism to align concepts in video and textual knowledge encapsulated within the prompts. Extensive experiments on various video benchmarks, including zero-shot, few-shot, and base-to-novel generalization settings, demonstrate that our method not only sets new SOTA performance but also possesses excellent interpretability.

Primary Subject Area: [Content] Vision and Language

Secondary Subject Area: [Content] Multimodal Fusion, [Experience] Multimedia Applications

Relevance To Conference: This work is dedicated to advancing open-vocabulary video action recognition, a paramount challenge within multimedia processing. We introduce a novel Action-Centric generation strategy for crafting action-conditioned prompts, further enhanced by a multi-modal action knowledge alignment mechanism. This innovative approach marks a substantial application of multimodal techniques in addressing the complexities of video action recognition. It not only tackles an essential issue in the multimedia domain but also illustrates the efficacy of multimodal methods in deepening and refining action recognition capabilities. Through our strategy, we demonstrate significant progress towards sophisticated multimedia processing technologies. We firmly believe that our method contributes profoundly to both multimedia and multimodal processing, setting a new precedent for future research and development in these fields.

Supplementary Material: zip

Submission Number: 1520

Loading