Abstract: Sign Language Translation (SLT) aims to convert sign language videos into corresponding spoken text sequences. However, the inherent modality gap between sign language video and text hinders the development of SLT. Motivated by the linguistic consistency between gloss 1 and text, we propose EMF-SLT, an Explicit Multi-modal Fusion method for Sign Language Translation to mitigate the modality gap with the help of gloss. Specifically, EMF-SLT first leverages a vector quantizer and a fusion module to align and fuse sign language and gloss features, respectively, resulting in more informative multi-modal features for the decoder. Then, a multi-task mutual learning framework is introduced to regularize the output predictions from different modalities, which ensures the consistency of outputs across modalities and encourages different modalities to learn from each other. Experiments on two SLT benchmarks and further analyses show that our method achieves significant improvements over the baselines and effectively alleviates the modality gap.
Loading