Generating Action-conditioned Prompts for Open-vocabulary Video Action Recognition

Published: 20 Jul 2024, Last Modified: 06 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Exploring open-vocabulary video action recognition is a promising venture, which aims to recognize previously unseen actions within any arbitrary set of categories. Existing methods typically adapt pretrained image-text models to the video domain, capitalizing on their inherent strengths in generalization. A common thread among such methods is the augmentation of visual embeddings with temporal information to improve the recognition of seen actions. Yet, they compromise with standard less-informative action descriptions, thus faltering when confronted with novel actions. Drawing inspiration from human cognitive processes, we argue that augmenting text embeddings with human prior knowledge is pivotal for open-vocabulary video action recognition. To realize this, we innovatively blend video models with Large Language Models (LLMs) to devise Action-conditioned Prompts. Specifically, we propose the Action-Centric generation strategy to produce a set of descriptive sentences that contain distinctive features for identifying given actions. Building upon this foundation, we further introduce a multi-modal action knowledge alignment mechanism to align concepts in video and textual knowledge encapsulated within the prompts. Extensive experiments on various video benchmarks, including zero-shot, few-shot, and base-to-novel generalization settings, demonstrate that our method not only sets new SOTA performance but also possesses excellent interpretability.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Multimodal Fusion, [Experience] Multimedia Applications
Relevance To Conference: This work is dedicated to advancing open-vocabulary video action recognition, a paramount challenge within multimedia processing. We introduce a novel Action-Centric generation strategy for crafting action-conditioned prompts, further enhanced by a multi-modal action knowledge alignment mechanism. This innovative approach marks a substantial application of multimodal techniques in addressing the complexities of video action recognition. It not only tackles an essential issue in the multimedia domain but also illustrates the efficacy of multimodal methods in deepening and refining action recognition capabilities. Through our strategy, we demonstrate significant progress towards sophisticated multimedia processing technologies. We firmly believe that our method contributes profoundly to both multimedia and multimodal processing, setting a new precedent for future research and development in these fields.
Supplementary Material: zip
Submission Number: 1520
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview