Improving Sign Language Understanding with a Multi-Stream Masked Autoencoder Trained on ASL Videos

ACL ARR 2025 May Submission2367 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Understanding sign language remains a significant challenge, particularly for low-resource sign languages with limited annotated data. Motivated by the success of large-scale pretraining in deep learning, we propose Multi-Stream Masked Autoencoder (MS-MAE) — a simple yet effective framework for learning sign language representations from skeleton-based video data. Our approach begins with pretraining MS-MAE on the large-scale YouTube-ASL dataset, using a masked reconstruction objective to model sign sequences. The pretrained model is then adapted to multiple downstream tasks across different sign languages. Experimental results show that, after fine-tuning, MS-MAE achieves competitive or superior performance on a range of isolated sign language recognition benchmarks, including WLASL, ASL Citizen, Slovo, and the JSL Corpus. Furthermore, it demonstrates strong performance on sign language translation tasks, achieving results comparable to state-of-the-art methods on PHOENIX14T, CSL-Daily, and How2Sign. These findings highlight the potential of leveraging large-scale, high-resource sign language data to boost performance in low-resource sign language scenarios.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: cross-modal content generation, cross-modal machine translation, self-supervised learning
Contribution Types: NLP engineering experiment, Approaches to low-resource settings
Languages Studied: Sign language
Submission Number: 2367
Loading