Keywords: Sign Language Translation; Vision–Language Models; Multimodal Learning; Parameter-Efficient Fine-Tuning; Low-Resource Languages; Inclusive AI
Abstract: Vision-Language Models (VLMs) have shown strong generalization across multimodal tasks, but their capacity to handle \textit{sign language translation (SLT)}---which requires fine-grained spatiotemporal reasoning and linguistic understanding---remains unclear. In this study, we evaluate whether \textit{small-scale VLMs} ($\leq$3B parameters) can perform SLT effectively.
We conduct supervised fine-tuning using multilingual sign language datasets—DGS, ASL, and ISL—adopting parameter-efficient LoRA tuning applied to the language decoder, while keeping the vision encoder frozen and allowing the connector to be trainable. To evaluate translation quality, we propose entity- and semantics-aware metrics tailored for SLT. We highlight the data imbalance issues present in the above widely used SLT datasets.
Our analysis highlights the limitations in applying general-purpose VLMs to SLT, unlike their applicability in other tasks, and provides insights to inform future development of VLMs for SLP, which is essential for building inclusive AI applications.
Paper Type: Long
Research Area: Computational Social Science, Cultural Analytics, and NLP for Social Good
Research Area Keywords: NLP tools for social analysis, sociolinguistics, Cross-modal machine translation, video processing, evaluation and metrics, parameter-efficient-training, NLP in resource-constrained settings
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Approaches low compute settings-efficiency, Data analysis
Languages Studied: German Sign Language, American Sign Language, Indian Sign Language
Submission Number: 9227
Loading