Isolated Sign Language Recognition Based on Heterogeneous Attention and Wavelet Temporal Graph Convolutional Networks

Published: 2025, Last Modified: 28 Dec 2025CSCWD 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: This paper proposes an isolated sign language recognition method based on a heterogeneous attention mechanism and Wavelet Temporal Graph Convolutional Networks (WT-GCN) to improve recognition accuracy in human-computer interaction scenarios. Although the winner of the ISLR challenge in CVPR 2021 used a multi-stream fusion method to integrate results from four streams (Joint Stream, Bone Stream, Joint Motion Stream, Bone Motion Stream), it limited the effective utilization of feature information by using the same network across multiple streams and failed to fully integrate both short-term and long-term temporal information. To address this, we introduce WT-GCN, which combines wavelet convolution with temporal convolutional networks (TCN) and graph convolutional networks (GCN) to enhance the model's ability to capture multi-scale temporal features and model short- and long-term dependencies in sign motions more accurately. Additionally, we introduce a heterogeneous attention mechanism composed of Spatial, Temporal, and Channel-wise attention (STC) and Median-Optimized Compound Spatial and Channel Attention Block (MOCCS), which enhances the network's ability to capture skeletal motion features in the Bone Motion (BM) Stream through median pooling. Experimental results on the AUTSL dataset show that the proposed method achieves an overall multi-stream accuracy of 96.34%, with the Bone Motion Stream reaching 93.89%, a 1.4% improvement over existing methods.
Loading