Abstract: Vision Transformers (ViTs) in human activity recognition tasks suffer from inadequate spatial modeling through conventional position embeddings, leading to over-reliance on fixed positional information. This paper proposes Shuffled Positional Embedding (SPE), a mechanism that randomly disrupts the order of positional encoding during each forward propagation, reducing model dependence on position embedding and encouraging exploration of intrinsic spatial relationships. While SPE enhances general spatial awareness, it lacks targeted guidance for human-centric modeling. To address this limitation, Local Shuffled Skeleton Position Embedding (LSSPE) is developed, which leverages 2D skeleton data to provide human body structure-aware spatial representation. LSSPE computes attention weights based on spatial distances between image patches and skeleton keypoints, incorporating joint motion amplitudes for enhanced modeling. To further utilize skeleton data, a dual-stream architecture is designed combining TimeSFormer with LSSPE (LSSPE-TimeSFormer) for RGB processing and SkateFormer for skeleton processing. The proposed dual-stream model achieves outstanding performance of 95.8\% and 98.7\% accuracy on NTU RGB+D cross-subject and cross-view settings, establishing the effectiveness of skeleton-aware position embedding for human activity recognition.
Submission Number: 115
Loading