Abstract: Sign language recognition (SLR) refers to interpreting sign language glosses from given videos automatically. This research area presents a complex challenge in computer vision because of the rapid and intricate movements inherent in sign languages, which encompass hand gestures, body postures, and even facial expressions. Recently, skeleton-based action recognition has attracted increasing attention due to its ability to handle variations in subjects and backgrounds independently. However, current skeleton-based SLR methods exhibit three limitations: 1) they often neglect the importance of realistic hand poses, where most studies train SLR models on non-realistic skeletal representations; 2) they tend to assume complete data availability in both training or inference phases, and capture intricate relationships among different body parts collectively; 3) these methods treat all sign glosses uniformly, failing to account for differences in complexity levels regarding skeletal representations. To enhance the realism of hand skeletal representations, we present a kinematic hand pose rectification method for enforcing constraints. Mitigating the impact of missing data, we propose a feature-isolated mechanism to focus on capturing local spatial-temporal context. This method captures the context concurrently and independently from individual features, thus enhancing the robustness of the SLR model. Additionally, to adapt to varying complexity levels of sign glosses, we develop an input-adaptive inference approach to optimise computational efficiency and accuracy. Experimental results demonstrate the effectiveness of our approach, as evidenced by achieving a new state-of-the-art (SOTA) performance on WLASL100 and LSA64. For WLASL100, we achieve a top-1 accuracy of 86.50\%, marking a relative improvement of 2.39\% over the previous SOTA. For LSA64, we achieve a top-1 accuracy of 99.84\%. The artefacts and code related to this study are made publicly online (https://github.com/mpuu00001/Siformer.git).
Primary Subject Area: [Content] Media Interpretation
Secondary Subject Area: [Content] Vision and Language, [Experience] Multimedia Applications, [Experience] Interactions and Quality of Experience
Relevance To Conference: In this study, we introduce a novel method, named Siformer, for skeleton-based sign language recognition (SLR). Siformer aims to interpret sign language (SL) glosses from videos, thereby integrating vision and language, which aligns with the broad goal of understanding multimedia content across diverse applications, especially in the SL communication context. The proposed approach contributes to a more comprehensive understanding of skeletal data processing and achieves a new state-of-the-art performance on all two benchmark datasets. To enhance the realism of hand skeletal representations, we present a hand pose rectification method based on the kinematic constraints of hand joints. To bolster the robustness towards missing data, we proposed a feature-isolated mechanism, which captures locally focused feature maps from individual features. Additionally, our input-adaptive inference approach dynamically adjusts computational paths, optimising the performance across varying complexity levels of sign glosses. The lightweight design of Siformer facilitates its practical implementation on handheld devices, opening up possibilities for various applications such as online SL learning, daily communication, and SL typing methods. We hope our research contributes to practical contributions to the field of SLR, leading to better inclusivity and communication for individuals who use SL.
Supplementary Material: zip
Submission Number: 4860
Loading