CascadeFormer: A Family of Two-stage Cascading Transformers for Skeleton-based Human Action Recognition

Yusen Peng; Alper Yilmaz

CascadeFormer: A Family of Two-stage Cascading Transformers for Skeleton-based Human Action Recognition

Yusen Peng, Alper Yilmaz

15 Sept 2025 (modified: 18 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Action Recognition, Vision Transformers

Abstract: Skeleton-based human action recognition leverages sequences of human joint coordinates to identify actions performed in videos. Owing to the intrinsic spatiotemporal structure of skeleton data, Graph Convolutional Networks (GCNs) have been the dominant architecture in this field. However, recent advances in transformer models and masked pretraining frameworks open new avenues for representation learning. In this work, we propose CascadeFormer, a family of two-stage cascading transformers for skeleton-based human action recognition. Our framework consists of a masked pretraining stage to learn generalizable skeleton representations, followed by a cascading fine-tuning stage tailored for discriminative action classification. We evaluate CascadeFormer across three benchmark datasets, Penn Action, N-UCLA, and NTU RGB+D 60, achieving competitive performance on all tasks. To promote reproducibility, we will release our code and model checkpoints.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 6391

Loading