Spatio-Temporal Dual Attention with Cross-Sensor Attention for Enhanced IMU-based Human Activity Recognition

Published: 22 Sept 2025, Last Modified: 22 Sept 2025WiML @ NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Human Activity Recognition, IMU Data, Cross-Sensor Attention, Transformers
Abstract: Human Activity Recognition (HAR) using Inertial Measurement Unit (IMU) sensor data has emerged as a critical research domain with applications spanning healthcare monitoring, surveillance systems, and sports analytics. Despite significant advances in deep learning architectures, existing approaches exhibit limitations in capturing complex spatiotemporal dependencies inherent in human activities and fail to exploit coordinated movements across multiple body segments. This work presents the Spatio-Temporal Dual Attention Transformer with Cross-Sensor Attention (STDual-X), a novel framework for human activity recognition (HAR). STDual-X processes sensor readings from each IMU placed on a body part using a Spatio-Temporal Dual Attention Transformers (STDAT) (1) to generate a vector that encodes temporal and spatial movement patterns. The resulting vectors from multiple body-worn IMUs are subsequently input to the Cross-Sensor Attention (CSA) module, which applies cross-attention to model inter-sensor relationships, generating Cross-Sensor Relation (CSR) vectors that encode dependencies between sensors. The final embeddings, enriched with cross-sensor information, are processed through a fully connected neural network with a softmax layer to classify human activities. The model is optimized using a composite loss function combining Cross-Entropy Loss for accurate activity classification and a Contrastive Loss, based on triplet loss, to enhance the discriminative quality of the feature embeddings. Extensive experiments on three benchmark datasets demonstrate the superior performance of STDual-X, achieving state-of-the-art accuracies of 97.02% on PAMAP2(2), 95.43% on Opportunity(3), and 98.80% on UCI-HAR (4) datasets. Comprehensive ablation studies validate the contribution of each architectural component, revealing that removal of STDAT reduces accuracy by 7.74%, elimination of CSA decreases performance by 2.34%, and exclusion of contrastive loss results in a 4.64% accuracy reduction. Quantitative analysis using t-SNE visualization demonstrates superior feature separability with a Silhouette score of 0.5826, significantly outperforming baseline approaches. Cross-sensor attention visualization reveals meaningful inter-sensor relationships that align with biomechanical principles of human movement, including high hand-ankle coordination during walking activities and uniform attention distribution during static postures. The framework exhibits robust performance under adverse conditions, maintaining 96.78% accuracy with 10% missing data and demonstrating graceful degradation with noise injection. Computational efficiency analysis reveals an inference time of 1.6 milliseconds per sample with 144.72 MB peak memory usage on NVIDIA Tesla P100 hardware, indicating suitability for edge deployment scenarios. The proposed STDual-X framework advances the stateof- the-art in IMU-based HAR by effectively addressing spatiotemporal feature extraction challenges and establishing a new paradigm for IMU-based activity recognition systems. These contributions have significant implications for real-world applications requiring accurate, efficient, and privacy-preserving activity monitoring capabilities. References [1] D. Senarath, S. Tharinda, M. Vishvajith, S. Rasnayaka, S.Wickramanayake, and D. Meedeniya, “Behaveformer: A framework with spatio-temporal dual attention transformers for imu-enhanced keystroke dynamics,” in 2023 IEEE International Joint Conference on Biometrics (IJCB), 2023, pp. 1–9. [2] A. Reiss, “PAMAP2 Physical Activity Monitoring,” UCI Machine Learning Repository, 2012. [3] D. Roggen, A. Calatroni, M. Rossi, T. Holleczek, K. F¨orster, G. Tr¨oster, P. Lukowicz, D. Bannach, G. Pirkl, A. Ferscha, J. Doppler, C. Holzmann, M. Kurz, G. Holl, R. Chavarriaga, H. Sagha, H. Bayati, M. Creatura, and J. del R. Mill´an, “Collecting complex activity datasets in highly rich networked sensor environments,” Seventh International Conference on Networked Sensing Systems (INSS), pp. 233–240, 2010. [4] A. D. G. A. O. L. Reyes-Ortiz, Jorge and X. Parra, “Human Activity Recognition Using Smartphones,” UCI Machine Learning Repository, 2013.
Submission Number: 109
Loading