Description Attribute-Enhanced Spatio-Temporal Zero-Shot Action Recognition

Published: 01 Jan 2024, Last Modified: 28 Jul 2025ICPRAI (2) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Vision-Language Models (VLMs) have shown remarkable performance in zero-shot action recognition by learning the correlation between video embeddings and class embeddings. However, an issue arises when depending solely on action classes for semantic information due to multi-semantic words - words with multiple meanings. Theses words hinder the difficulty of the model to accurately capture the intended concepts of actions. We propose a novel approach which leverages web-crawled descriptions with utilizing a large-language model for the extraction of keywords. This method reduces the reliance on human annotators and avoids the exhaustive manual process of attribute data creation. Moreover, we introduce a spatio-temporal interaction module which focuses on objects and action units to align description attributes with video content. In zero-shot experiment, our model achieves \(81.0\%\), \(53.1\%\), and \(68.9\%\) on UCF-101, HMDB-51, and Kinetics-600, respectively, which demonstrates the transferability of our model to downstream tasks.
Loading