NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities

ACL ARR 2025 May Submission2564 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Enhancing the linguistic capabilities of Large Language Models (LLMs) to include low-resource languages is a critical research area. Current research directions predominantly rely on synthetic data generated by translating English corpora, which, while demonstrating promising linguistic understanding and translation abilities, often results in models aligned with source language culture. These models frequently fail to represent the cultural heritage and values of local communities. This work proposes a methodology to create both synthetic and retrieval-based pre-training data tailored to a specific community, considering its \textit{(i) language}, \textit{(ii) cultural heritage}, and \textit{(iii) cultural values}. We demonstrate our methodology using Egyptian and Moroccan dialects as testbeds, chosen for their linguistic and cultural richness and current underrepresentation in LLMs. As a proof-of-concept, we developed \textit{NileChat}, a 3B parameter LLM adapted for Egyptian and Moroccan communities, incorporating their language, cultural heritage, and values. Our results on various understanding, translation, and cultural and values alignment benchmarks show that \textit{NileChat} outperforms existing Arabic-aware LLMs of similar size and performs on par with larger models. We are sharing our methodology, data, and models with the community to promote the inclusion and coverage of more diverse communities in LLM development.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: data augmentation, NLP in resource-constrained settings, values and culture, datasets for low resource languages
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Data resources
Languages Studied: Moroccan Arabic dialect, Egyptian Arabic dialect
Submission Number: 2564
Loading