Decoupled spatio-temporal grouping transformer for skeleton-based action recognition

Published: 2024, Last Modified: 14 Jan 2026Vis. Comput. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Capturing correlations between joints is crucial in skeleton-based action recognition tasks. Transformer has demonstrated its capability in capturing such correlations. However, conventional Transformer-based approaches model the relationships between joints in a unified spatio-temporal dimension, disregarding the unique semantic information that exists in both the spatial and temporal dimensions of skeleton sequences. To address this issue, we present a novel decoupled spatio-temporal grouping Transformer (DSTGFormer) model. The skeleton sequence is split into multiple spatio-temporal groups, each containing a set of consecutive frames. The spatio-temporal positional encoding (STPE) module assigns identity information to each element in the sequence. The spatio-temporal grouping self-attention (STGA) module captures the spatial and temporal relationships between different joints within a spatio-temporal group. This decoupling of the spatial and temporal dimensions enables the extraction of semantic information with different meanings in each dimension. Additionally, we propose a within-group spatial global regularization mechanism to learn more general spatial attention maps, and an inter-group feature aggregation (IGFA) module to enhance the differentiation between similar actions. Our proposed method outperforms the state-of-the-art methods on two large-scale datasets in terms of both recognition accuracy and computational efficiency.
Loading