KSRB-Net: a continuous sign language recognition deep learning strategy based on motion perception mechanism

Feng Xiao, Yunrui Zhu, Ruyu Liu, Jianhua Zhang, Shengyong Chen

Published: 01 Jan 2024, Last Modified: 04 Nov 2025Vis. Comput. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Continuous sign language recognition (CSLR) is an intricate task aimed at transcribing sign language sequences from continuous video streams into sentences. Typically, deep learning-based CSLR systems are composed of a visual input encoder for feature extraction and a sequence learning model for the corresponding relationship between the input sequence and output sentence-level labels. The complex nature of sign language, characterized by an extensive vocabulary and many similar gestures and motions, renders CSLR particularly challenging. Additionally, the unsupervised nature of CSLR due to the unavailability of signing glosses for alignment necessitates detailed labeling of each word in a sentence, thereby limiting the amount of training data available. In this paper, we proposes a CSLR framework named KSRB-Net to address these critical problems. The proposed method incorporates a practical module that efficiently captures frame-wise motion information and spatio-temporal context information, which can be embedded into existing feature extraction modules. Additionally, a keyframe extraction algorithm based on the characteristics of the sign language dataset is designed to significantly accelerate the model training and reduce the risk of overfitting. Finally, connectionist temporal classification is employed as the objective function to capture the alignment proposal. The proposed method is validated on three datasets, namely the Chinese TJUT-SLRT, the Chinese USTC-CSL, and the German RWTH-Phoenix-Weather-2014. Experiment results demonstrate that the KSRB-Net achieves 98.40% accuracy and outperforms state-of-the-art methods in terms of efficiency and accuracy.