Semanticmatters: Aconstrainedapproachforzero-shotvideoaction recognition

Zhenzhen Quan

Published: 25 Jan 2025, Last Modified: 05 Mar 2025OpenReview Archive Direct UploadEveryoneCC BY-NC 4.0

Abstract: Zero-shot video action recognition has advanced significantly due to the adaptation of visual-language models, such as CLIP, to video domains. However, existing methods attempt to adapt CLIP to video tasks by leveraging temporal information, neglecting the semantic information (i.e. the latent categories and their relationships) within videos. In this paper, we propose a Semantic Constrained CLIP (SC-CLIP) approach that leverages semantic information to adjust CLIP for video recognition while ensuring its performance on unseen data. SC-CLIP comprises a semantic-related query generation module and a semantic constrained cross attention module. First, the semantic-related query generation module clusters dense tokens from CLIP to generate semantic-related mask. The semantic-related query is then derived by pooling the adapted CLIP output using the semantic-related mask. Next, the semantic constrained cross attention module feeds the generated semantic-related query back into CLIP to probe semantic-related values, enhancing their ability to leverage the vision-language matching capabilities of CLIP. By generating semantic-related query, the semantic information aids in distinguishing similar actions, thereby improving performance on unseen samples. Experimental results on three zero-shot action recognition benchmarks show improvements of up to 1.9% and 2% in harmonic mean under two settings. Code is available at https://github.com/quanzhenzhen/SC-CLIP.