Abstract: Vision-Language Models (VLMs) have shown remarkable performance in zero-shot action recognition by learning the correlation between video embeddings and class embeddings. However, an issue arises when depending solely on action classes for semantic information due to multi-semantic words - words with multiple meanings. Theses words hinder the difficulty of the model to accurately capture the intended concepts of actions. We propose a novel approach which leverages web-crawled descriptions with utilizing a large-language model for the extraction of keywords. This method reduces the reliance on human annotators and avoids the exhaustive manual process of attribute data creation. Moreover, we introduce a spatio-temporal interaction module which focuses on objects and action units to align description attributes with video content. In zero-shot experiment, our model achieves \(81.0\%\), \(53.1\%\), and \(68.9\%\) on UCF-101, HMDB-51, and Kinetics-600, respectively, which demonstrates the transferability of our model to downstream tasks.
Loading