Improving End-to-End Sign Language Translation via Multi-Level Contrastive Learning

Biao Fu, Liang Zhang, Peigen Ye, Pei Yu, Cong Hu, Xiaodong Shi, Yidong Chen

Published: 01 Jan 2025, Last Modified: 10 Nov 2025IEEE Transactions on Audio, Speech and Language ProcessingEveryoneRevisionsCC BY-SA 4.0
Abstract: Sign Language Translation (SLT) aims to translate content from a sign language video into a spoken language sentence, which is a promising technology to bridge the communication gap between the deaf and hearing people. The end-to-end SLT model is increasingly becoming the dominant paradigm due to its inherent advantages in reducing error propagation and latency. Nonetheless, a significant limitation of this end-to-end paradigm is its heavy dependency on large-scale parallel data. This dependence poses a challenge to improving SLT performance due to the prohibitive costs associated with data collection and annotation. Our preliminary observation shows that this data scarcity leads to the collapse of the token (sub-word unit in this paper) representations and the inaccuracy of the generated tokens. To alleviate this issue, we propose MCL-SLT, a novel Multi-level Contrastive Learning method for SLT, which incorporates token- and sentence-level contrastive learning into SLT training to learn effective token representations. Specifically, token-level contrastive learning generates positive examples by data augmentation and selects diverse and hard negative examples from vocabulary to learn more discriminative token representations. Additionally, sentence-level contrastive learning utilizes sign video representations as anchors to refine the quality of token representations. Extensive experiments on three widely used datasets, PHOENIX-2014T, CSL-Daily, and How2Sign, demonstrate the effectiveness of MCL-SLT, with significant improvements over baselines, and also show superior robustness and generalization of our method in signer-independent settings.
Loading