Actionhub: a large-scale action video description dataset for zero-shot action recognition

Published: 15 Jan 2024, Last Modified: 27 Jul 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0
Abstract: Zero-shot action recognition (ZSAR) aims to learn an alignment model between videos and class descriptions of seen actions that is transferable to unseen actions. There is a rich diversity in video content, including complex scenes, dynamic human motions, etc. The text queries (class descriptions) used in existing ZSAR works, however, are often short action names that fail to capture the rich semantics in the videos, leading to misalignment. With the intuition that video content descriptions (e.g., video captions) can provide rich contextual information of visual concepts in videos, which helps model understand human actions from text modality, we propose to utilize human annotated video descriptions to enrich the semantics of the class descriptions of each action. However, all existing action video description datasets are limited in terms of the number of actions, the diversity of actions, the semantics of video descriptions, etc. To this end, we collect a large-scale action video descriptions dataset named ActionHub, which covers a total of 1,211 common actions and provides 3.6 million action video descriptions. With the proposed ActionHub dataset, we find that the semantics of human actions can be better captured from the textual modality, such that the cross-modality diversity gap between videos and texts in ZSAR is alleviated, and a transferable alignment is learned for recognizing unseen actions. To achieve this, we further propose a novel Cross-modality and Cross-action Modeling (CoCo) framework for ZSAR, which consists of a Dual Cross-modality Alignment module and a Cross-action Invariance Mining module. Specifically, the Dual Cross-modality Alignment module utilizes both action labels and video descriptions from ActionHub to obtain rich class semantic features for feature alignment. The Cross-action Invariance Mining module exploits a cycle-reconstruction process between the class semantic feature spaces of seen actions and unseen actions, aiming to guide the model to learn cross-action invariant representations. Extensive experimental results demonstrate that our CoCo framework significantly outperforms the state-of-the-art on three popular ZSAR benchmarks (i.e.,Kinetics-ZSAR, UCF101 and HMDB51) under two different learning protocols in ZSAR, proving the efficacy of our proposed dataset and method. We will release our code, models, and the proposed ActionHub dataset.
Loading