Bhaasha, Bhāṣā, Zaban: A Survey for Low-Resourced Languages in South Asia – Current Stage and Challenges

ACL ARR 2025 February Submission5200 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Rapid developments of pre-trained or large language models have revolutionized many NLP tasks on English datasets in recent years, unfortunately, the model developments and evaluations for low-resource languages are being overlooked, especially for languages in South Asia. While there are over 650 languages in South Asia, many of them either have very limited computational resources or are not supported in existing language models. Thus, a concrete question to be solved by this study is: \textit{Can we assess the current stage and challenges to inform our NLP community and facilitate model developments for South Asian languages?} In this survey, we have comprehensively examined current efforts and challenges of NLP model development for low-resourced South Asian languages by retrieving studies published since 2020 with a focus on transformer-based language models, such as BERT, T5, and GPT. Our study has presented insights and issues from 3 essential aspects, data, model, and tasks, such as available data sources, fine-tuning strategies, and domain applications. Our findings highlight substantial challenges, such as missing data across critical domains (e.g., health), code-mixing, and a lack of standardized evaluation procedures. We hope that our survey efforts can raise community attentions for more targeted data curation, unified benchmarks tailored to the cultural and linguistic nuances of South Asia, and stronger collaborative efforts to ensure an equitable representation for all languages in the South Asia.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: Survey, Multilingual, Low-resource, South Asia, multilingualism, mixed language
Contribution Types: Approaches to low-resource settings, Data resources, Data analysis, Surveys
Languages Studied: Hindi, Bengali, Urdu, Pashto, Balochi, Persian, Telugu, Tamil, Kannada, Manipuri, Mizo
Submission Number: 5200
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview