Gesture recognition with adaptive-weight-based residual MultiheadCrossAttention fusion based on multi-level feature information

Zhuang Li, Dahua Shou

Published: 01 Jan 2025, Last Modified: 13 May 2025Inf. Fusion 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Deep learning-based gesture recognition has attracted considerable attention owing to its vast potential across numerous applications. Current approaches predominantly focus on processing a single modality of sensor signals separately, either being limited to single temporal feature extraction—failing to capture intricate spatiotemporal dependencies, or relying on fixed time-frequency transformations—unable to flexibly adapt to the dynamic frequency characteristics of the signal. These constraints of data format give rise to reduced generalization capability across varying sensor inputs and complex tasks. In this paper, we propose a novel hybrid network to explore multi-modal representations inherent in the signals from different perspectives, namely LI-TFMNet, which integrates a cross-modal branch LI-Net and cross-domain branch TFM-Net. Specifically, the cross-modal branch (LI-Net) introduces four learnable temporal 2-Dimensionalization methods to model the relationship between temporal features and two-dimensional image-based representation. In parallel, the cross-domain branch (TFM-Net) employs a novel signal-to-image conversion approach, tailored for multi-channel time series data to fully explore the interactions among multiple sequences. Additionally, we construct three multi-scale depth-wise convolution modules to extract features from each modality at multiple scales. An adaptive-weight-based fusion strategy based on the multi-head cross-attention mechanism ARMHCA is designed to not only refine the within-modal representations but also effectively capture and leverage inter-modal correlations. Extensive experiments on multiple gesture recognition datasets demonstrate that LI-TFMNet consistently outperforms all existing methods across varying window lengths. Moreover, it achieves state-of-the-art (SOTA) performance on the UCI-HAR human activity recognition database, further showcasing its strong generalization and robustness. We anticipate that the various methodologies proposed in this study will provide new perspectives and directions for the field of time series classification.