Sign-Language Datasets at Scale: A Comprehensive Survey on Resources, Benchmarks, and Annotation Standards

ACL ARR 2025 July Submission396 Authors

27 Jul 2025 (modified: 29 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Sign languages are expressive visual languages used by Deaf and Hard-of-Hearing (DHH) communities. Despite advances in sign-language recognition, translation, and production, progress remains limited by fragmented datasets, inconsistent annotations, and narrow linguistic coverage. Existing benchmarks often fail to support real-world communication needs, and systematic analyses of their limitations are rare. In this survey, we present the most comprehensive index of sign-language datasets to date—covering 119 resources across 35 signed languages—and identify key challenges, including modality imbalance, annotation granularity, and signer bias. We propose essential requirements for future datasets and introduce a 24-field Sign-Language Datasheet template, along with a public GitHub repo for dataset documentation. Our work provides a unified foundation for developing inclusive and robust sign-language technologies.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Sign Language, Multimodal Dataset, Survey, Annotation, Benchmark
Contribution Types: Model analysis & interpretability, Reproduction study, Publicly available software and/or pre-trained models, Data resources, Data analysis, Surveys
Languages Studied: American Sign Language (ASL), British Sign Language (BSL), German Sign Language (DGS), Chinese Sign Language (CSL), French Sign Language (LSF), Spanish Sign Language (LSE), Italian Sign Language (LIS), Russian Sign Language (RSL), Japanese Sign Language (JSL), Korean Sign Language (KSL), Turkish Sign Language (TID), Brazilian Sign Language (Libras), Australian Sign Language (Auslan), Indian Sign Language (ISL), and others
Submission Number: 396
Loading