Beyond a Single Reference: Training and Evaluation with Paraphrases in Sign Language Translation

ACL ARR 2026 January Submission6507 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: sign language, machine translation, sign language translation, paraphrasing, human evaluation, large language models, metrics, BLEU
Abstract: Most Sign Language Translation (SLT) corpora pair each signed utterance with a single written-language reference, despite the highly non-isomorphic relationship between sign and spoken languages, where multiple translations can be equally valid. This limitation constrains both model training and evaluation, particularly for n-gram–based metrics such as BLEU. In this work, we investigate the use of Large Language Models to automatically generate paraphrased variants of written-language translations as synthetic alternative references for SLT. First, we compare multiple paraphrasing strategies and models using an adapted ParaScore metric. Second, we study the impact of paraphrases on both training and evaluation of the pose-based T5 model on the YouTubeASL and How2Sign datasets. Our results show that naively incorporating paraphrases during training does not improve translation performance and can even be detrimental. In contrast, using paraphrases during evaluation leads to higher automatic scores and better alignment with human judgments. To formalize this observation, we introduce BLEU_para, an extension of BLEU that evaluates translations against multiple paraphrased references. Human evaluation confirms that BLEU_para correlates stronger with perceived translation quality. We release all generated paraphrases, generation and evaluation code to support reproducible and more reliable evaluation of SLT systems.
Paper Type: Long
Research Area: Machine Translation
Research Area Keywords: automatic evaluation, human evaluation, multimodality, cross-modal machine translation, evaluation methodologies, metrics, evaluation, paraphrasing, data augmentation, generative models, automatic evaluation, human evaluation
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Theory
Languages Studied: American Sign Language, English
Submission Number: 6507
Loading