BYOL: Bring Your Own Language Into LLMs

04 May 2026 (modified: 13 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Language Models (LLMs) exhibit strong multilingual capabilities, yet remain fundamentally constrained by the severe imbalance in global language resources. While over 7,000 languages are spoken worldwide, only a small subset (<100) has sufficient digital presence to meaningfully influence modern LLM training. This disparity leads to systematic underperformance, cultural misalignment, and diminished accessibility for speakers of low-resource and extreme-low-resource languages. To address this gap, we introduce Bring Your Own Language (BYOL), a unified framework that enables scalable, language-aware LLM development tailored to each language's digital footprint. BYOL begins with a language resource classification—mapping languages into four tiers (Extreme-Low, Low, Mid, High) based on curated web-scale corpora, and uses this classification to determine the appropriate integration strategy. For low-resource languages, we propose a full-stack data refinement and expansion pipeline, combining corpus cleaning, synthetic text generation, continual pretraining, and supervised finetuning. Applied to Chichewa and Māori, this pipeline yields two language-specific LLMs that achieve ~12% average improvement over strong multilingual baselines across 12 benchmarks, while preserving English and multilingual capabilities via weight-space model merging. For extreme-low-resource languages, we introduce a translation-mediated inclusion pathway, demonstrating with Inuktitut that a tailored MT system can deliver +4 BLEU improvement over a commercial baseline, enabling high-accuracy LLM access in settings where direct modeling is otherwise infeasible. Our results show that BYOL offers a practical, extensible, and data-efficient recipe for expanding LLM capabilities to the long tail of the world's languages. We will release human-translated versions of the Global MMLU-Lite benchmark in Chichewa, Māori, and Inuktitut, and make our codebase and models publicly available.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Vimal_Thilak2
Submission Number: 8757
Loading