Full-Dimensional Optimizable Network: A Channel, Frame and Joint-Specific Network Modeling for Skeleton-Based Action Recognition

Published: 01 Jan 2024, Last Modified: 17 Apr 2025IJCNN 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Recent human action recognition systems widely adopt graph convolution networks to extract spatial-temporal movement patterns. In graph convolution layers, inter-joint and inter-frame dependencies dominate spatial and temporal feature aggregation and thus are pivotal to representation learning. To enrich learned motion patterns, a powerful feature extractor should introduce its information propagation flexibility into three dimensions: (1) inferring different inter-joint correlations at different frames; (2) inferring different inter-frame correlations at different joints; (3) inferring different inter-joint and interframe correlations at different channels. In this paper, we take a closer look at effective feature aggregation in a skeleton sequence and propose a novel full-dimensional optimizable network with Channel, Frame and Joint-specific Network (CFJ-s Net) modeling for improving action recognition. By promoting dynamic information flows within different channels, frames, and joints, CFJ-s Net significantly extracts richer body posture features and trajectory features from a skeleton sequence. As verified on three large-scale datasets, NTU RGB+D, NTU RGB+D 120, and Northwestern-UCLA, CFJ-s Net achieves substantial improvements over state-of-the-art methods.
Loading