SignFormer-GCN: Continuous Sign Language Translation using Spatio-Temporal Graph Convolutional Networks

Published: 22 Sept 2025, Last Modified: 27 Nov 2025WiML @ NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Sign Language Translation (SLT), Multimodal Learning, Spatio-Temporal Graph Convolutional Networks (STGCN), Transformer, Sequence-to-Sequence Models, Low-Resource Language
Abstract: Sign language is a complex visual language system that uses hand gestures, facial expressions, and body movements to convey meaning. It is the primary means of communication for millions of deaf and hard-of-hearing individuals worldwide. Tracking physical actions, such as hand movements and arm orientation, alongside expressive actions, including facial expressions, mouth movements, eye movements, eyebrow gestures, head movements, and body postures, using only RGB features can be limiting due to discrepancies in backgrounds and signers across different datasets. Despite this limitation, most Sign Language Translation (SLT) research relies solely on RGB features. We used keypoint features, and RGB features to capture better the pose and configuration of body parts involved in sign language actions and complement the RGB features. Similarly, most works on SLT research have used transformers, which are good at capturing broader, high-level context and focusing on the most relevant video frames. Still, the inherent graph structure associated with sign language is neglected and fails to capture low-level details. To solve this, we used a joint encoding technique using a transformer and STGCN architecture to capture the context of sign language expressions and spatial and temporal dependencies on skeleton graphs. Our method, SignFormer-GCN, achieves competitive performance in RWTH-PHOENIX-2014T (German Sign Language), How2Sign (American Sign Language), and BornilDB v1.0 (Bangla Sign Language) datasets experimentally, showcasing its effectiveness in enhancing translation accuracy through different sign languages. On RWTH-PHOENIX-2014T, our model achieves a BLEU-4 score of 19.75, approaching the gloss-free state of the art GFSLT-VLP, which reports 21.44, while requiring roughly 12 times fewer parameters, highlighting its efficiency–accuracy advantage. On How2Sign, SignFormer-GCN sets a new state of the art with an rBLEU of 2.96 and a BLEU-4 of 8.53, surpassing the previous best gloss-free model asl video2text with rBLEU 2.56 and BLEU-4 7.95. On BornilDB v1.0, we provide the first glossfree benchmark, achieving a BLEU-4 of 0.58 and extending continuous SLT research to low-resource sign languages.
Submission Number: 10
Loading