Improving Multilingual Capabilities with Cultural and Local Knowledge \\in Large Language Models While Enhancing Native Performance

Improving Multilingual Capabilities with Cultural and Local Knowledge \\in Large Language Models While Enhancing Native Performance

ACL ARR 2025 February Submission7086 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) have shown remarkable capabilities, but their development has primarily focused on English and other high-resource languages, leaving many languages underserved. We present a systematic approach for developing bilingual LLMs for English and Hindi that prioritizes computational efficiency. Using a curated dataset composed of English and Hindi instruction data of 485K samples, we instruction tuned models such as Qwen-2.5-14B-Instruct and Phi-4 to improve performance over both English and Hindi. Our experiments encompassing seven different LLMs of varying parameter sizes and over 140 training attempts with varying English-Hindi training data ratios demonstrated that it is possible to significantly improve multilingual performance without compromising native performance. Further, our approach avoids resource-intensive techniques like vocabulary expansion or architectural modifications, thus keeping the model size small. Our models demonstrate significant improvements, achieving up to 3.51\% higher average scores on Hindi tasks while maintaining or improving English performance. Our results indicate that modest fine-tuning with culturally and locally informed data can bridge performance gaps without incurring significant computational overhead. We release our training code, datasets, and models under mit and apache licenses to aid further research towards under-represented and low-resource languages.

Paper Type: Long

Research Area: Multilingualism and Cross-Lingual NLP

Research Area Keywords: Post-Training, Multilingual, LLMs

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models, Data resources, Data analysis

Languages Studied: English, Hindi

Submission Number: 7086

Loading