Abstract: Zero-shot action recognition (ZSAR) aims to learn an alignment model between videos and class descriptions of seen
actions that is transferable to unseen actions. There is a rich diversity in video content, including complex scenes, dynamic human
motions, etc. The text queries (class descriptions) used in existing ZSAR works, however, are often short action names that fail to
capture the rich semantics in the videos, leading to misalignment. With the intuition that video content descriptions (e.g., video
captions) can provide rich contextual information of visual concepts in videos, which helps model understand human actions from text
modality, we propose to utilize human annotated video descriptions to enrich the semantics of the class descriptions of each action.
However, all existing action video description datasets are limited in terms of the number of actions, the diversity of actions, the
semantics of video descriptions, etc. To this end, we collect a large-scale action video descriptions dataset named ActionHub, which
covers a total of 1,211 common actions and provides 3.6 million action video descriptions. With the proposed ActionHub dataset, we
find that the semantics of human actions can be better captured from the textual modality, such that the cross-modality diversity gap
between videos and texts in ZSAR is alleviated, and a transferable alignment is learned for recognizing unseen actions. To achieve this,
we further propose a novel Cross-modality and Cross-action Modeling (CoCo) framework for ZSAR, which consists of a Dual
Cross-modality Alignment module and a Cross-action Invariance Mining module. Specifically, the Dual Cross-modality Alignment
module utilizes both action labels and video descriptions from ActionHub to obtain rich class semantic features for feature alignment.
The Cross-action Invariance Mining module exploits a cycle-reconstruction process between the class semantic feature spaces of seen
actions and unseen actions, aiming to guide the model to learn cross-action invariant representations. Extensive experimental results
demonstrate that our CoCo framework significantly outperforms the state-of-the-art on three popular ZSAR benchmarks (i.e.,Kinetics-ZSAR, UCF101 and HMDB51) under two different learning protocols in ZSAR, proving the efficacy of our proposed dataset
and method. We will release our code, models, and the proposed ActionHub dataset.
Loading