Abstract: Sign Language Translation has advanced with deep learning, yet evaluations remain signer-dependent, with overlapping signers in training, development, and test sets. This raises concerns about whether models truly generalise or rely on signer-specific features. To address this, signer-fold cross-validation is conducted on GFSLT-VLP, GASLT, and SignCL—three leading, publicly available, non-proprietary gloss-free sign language translation models, with SignCL being among the most prominent. Experiments are performed on two benchmarking datasets, CSL-Daily and PHOENIX14T. The results reveal a significant performance drop under signer-independent settings. On PHOENIX14T, GFSLT-VLP sees BLEU-4 fall from 21.44 to as low as 3.59 and ROUGE-L from 42.49 to 11.89; GASLT drops from a reported 15.74 to 8.26; and SignCL from 22.74 to 3.66. These findings highlight the substantial overestimation of SLT model performance when evaluations are conducted under signer-dependent assumptions. This work proposes two key recommendations: (1) adopting signer-independent evaluation protocols, and (2) restructuring datasets to include signer-independent splits.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: model bias/fairness evaluation, model bias/unfairness mitigation
Contribution Types: Model analysis & interpretability, Reproduction study, Data analysis
Languages Studied: German Sign Language, Chinese Sign Language
Keywords: sign language translation, deep learning, machine translation
Submission Number: 4229
Loading