South AsianVoices in LLMs: Culturally Aware Multilingual Instruction Fine-Tuning for South Asian Low-Resource Languages
Abstract: Can large language models (LLMs) truly understand and represent the regional-wise rich cultural and linguistic diversity? Addressing this critical question, our study aims to develop a culturally adaptive multilingual instruction dataset and fine-tune LLMs to enhance their cultural alignment, multilingual fluency, and instruction-following capabilities across 15 South Asian low-resource languages. We systematically constructed the South Asian Instruction Dataset (SAID) by combining automated LLM-based semantic categorization, human-in-the-loop cultural tagging, and country-specific localization using state-of-the-art multilingual LLMs. This dataset spans eight SAARC countries and covers ten culturally relevant domains. We employed parameter-efficient LoRA fine-tuning on the LLaMA 3.1 Instruct model and conducted a comprehensive evaluation combining automated LLM judgment with large-scale human expert assessment. The resulting fine-tuned model, which we call SAID-LLaMA 3.1 Instruct, demonstrates substantial improvements over the base LLaMA 3.1 Instruct model in generating culturally aligned, factually accurate, and linguistically fluent responses for high- and mid-resource South Asian languages. Theoretically, this work advances understanding of how cultural adaptation and multilingual fine-tuning can enhance LLM performance in low-resource contexts. Practically, it provides a high-quality, culturally grounded instruction dataset and fine-tuning methodology that can guide the development of more inclusive AI systems for South Asia.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: multilingualism, multilingual representations, resources for less-resourced languages , indigenous languages , minoritized languages
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Data resources, Data analysis
Languages Studied: Sinhala, Nepali, Maithili, Punjabi, Assamese, Sanskrit, Urdu, Bengali, Dhivehi, Pashto, Dari, Awadhi, Marathi, Telugu, Dzongkha
Submission Number: 7429
Loading