Abstract: Skeleton-based action recognition (SAR) in videos is an important but challenging task in computer vision. The recent state-of-the-art (SOTA) models for SAR are primarily based on graph convolutional neural networks (GCNs), which are powerful in extracting the spatial information from skeleton data. However, their ability to capture temporal dynamics remains limited. To address this, we propose the G-Dev layer, which leverages path development—a principled and parsimonious representation for sequential data based on Lie group structures—to enhance temporal modeling. By integrating the G-Dev layer, the proposed DevLSTM module summarizes local temporal dynamics, reducing the time dimension while retaining high-frequency information. It can be conveniently applied to any temporal graph data, complementing existing advanced GCN-based models. Our empirical studies on the NTU-60, NTU-120 and Chalearn2013 datasets demonstrate that our proposed GCN-DevLSTM network consistently improves the strong GCN baseline models and achieves competitive performance. \footnote{The camera-ready version will contain a link to the code repository to ensure reproducibility.}
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We revised the manuscript substantially based on the reviewer feedback from the previous TMLR submission. First, we improved the clarity and presentation of the paper by revising Sections 2 and 3 to provide more intuitive explanations of path signatures and path development, including the motivation for time-reparameterization invariance. We also unified the terminology throughout the paper (e.g., consistently using “G-Dev layer”) and replaced the non-standard term “dual graph” with the graph-theoretic term “line graph”.
Second, we strengthened the experimental analysis by adding new ablation studies and scaling experiments. In particular, we added Table 5 to further show that removing the G-Dev layer causes a consistent performance drop across different numbers of blocks, with larger gains observed in smaller models. In addition, Table 9 in appendix is added to analyze the contributions of the G-Dev layer and the line graph under different stream settings, and introduced an additional comparison against CTR-GCN across different parameter scales in Appendix C. We also added runtime analysis alongside parameter comparisons to better evaluate the trade-off between accuracy and efficiency.
Third, we updated the related work and experimental comparisons to include more recent state-of-the-art methods, including BlockGCN and transformer-based approaches such as SkateFormer. Correspondingly, we revised the claims throughout the manuscript to avoid overstating performance, emphasizing instead that the proposed method achieves competitive performance while consistently improving strong GCN baselines.
Finally, we refined the overall structure and motivation of the paper by placing greater emphasis on both temporal modeling via the G-Dev layer and structural modeling via the line graph. All changes are highlighted in red throughout the manuscript.
Assigned Action Editor: ~Jicong_Fan2
Submission Number: 7834
Loading