Abstract: As a bridge between the deaf people and the outside, sign language primarily involves hand movements, complemented by intricate facial and body expressions. To enhance the engagement of sign language users in today's prevalent online social activities, it is necessary to incorporate video sign language recognition (SLR) technology into social multimedia. However, current multimodal video SLR methods predominantly rely on limited data, leading to poor robustness and susceptibility to overfitting. Additionally, most works employed simple concatenation, failing to explore effective interaction among different modalities. To address these issues, we propose a cross-attention modal fusion network (CAMFuse) based on red-green-blue (RGB) and skeleton to achieve more robust video SLR. First, CAMFuse introduces a comprehensive bimodal framework that considers coarse-grained features from body and fine-grained features from hands and face. Second, departing from previous skeleton-based video SLR methods represented through graphs, CAMFuse adopts heatmap volumes to reduce storage and promote subsequent interaction. Last, a space-time cross-attention fusion (ST-CAF) module is applied to the deeper feature extraction stages of RGB and skeleton, aiming to mine complementary relationships between them and learn the excellent information from each other, reducing the independence among modalities. Experimental results demonstrate the effectiveness of our proposed CAMFuse, outperforming the state-of-the-art methods on the popular isolated video sign language datasets WLASL-2000 (53.12%) and AUTSL (96.18%) dataset.
External IDs:dblp:journals/tcss/GaoHMJ25
Loading