STAR++: Region-aware Conditional Semantics via Interpretable Side Information for Zero-Shot Skeleton Action Recognition

Yang Chen, Jingcai Guo, Miaoge Li, Zhijie Rao, Song Guo

Published: 01 Jan 2026, Last Modified: 11 Mar 2026IEEE Transactions on Circuits and Systems for Video TechnologyEveryoneRevisionsCC BY-SA 4.0

Abstract: Zero-shot skeleton action recognition endeavors to classify novel action categories by transferring previously learned seen skeleton-semantic priors to unseen categories. However, current methods struggle to distinguish highly similar action categories, primarily due to the coarse-grained cross-modal alignment and non-discriminative representation space. To address these issues, we propose STAR++, a novel framework that aligns skeleton and semantics in a fine-grained and conditional manner. The key idea is to first establish region-level correspondences between body parts and semantic cues, and then utilize these local alignments to inform a global alignment process. This design is inspired by human visual cognition, which first attends to crucial local details before perceiving the broader scene. Concretely, we refine both skeleton and semantic representations with a dual-prompt attention mechanism driven by the structural decomposition of the human body and side information generated by a large language model (LLM). This encourages skeleton representations to be more compact within each class and semantic embeddings to be more separable across classes, which helps resolve ambiguity between highly similar actions and provides better interpretability of how unseen actions are perceived. Furthermore, we construct a region-aware holistic fusion module that aggregates these fine-grained features into a unified representation, yielding more discriminative holistic representations. Finally, the global alignment is conditioned on region-aware semantics feedback derived from fine-grained alignment, forming a conditional process that achieves more effective cross-modal alignment. Extensive experiments on four mainstream benchmarks demonstrate that our method achieves state-of-the-art performance in the zero-shot learning (ZSL) and generalized zero-shot learning (GZSL) settings.

External IDs:doi:10.1109/tcsvt.2026.3651695