EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models

Shaoxiong Ji; Zihao Li; Indraneil Paul; Jaakko Paavola; Peiqin Lin; Pinzhen Chen; Dayyán O'Brien; Hengyu Luo; Hinrich Schuetze; Jörg Tiedemann; Barry Haddow

EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models

Shaoxiong Ji, Zihao Li, Indraneil Paul, Jaakko Paavola, Peiqin Lin, Pinzhen Chen, Dayyán O'Brien, Hengyu Luo, Hinrich Schuetze, Jörg Tiedemann, Barry Haddow

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: multilingual adaptation, large language model.

TL;DR: we compile the MaLA corpus, a comprehensive multilingual dataset and enrich it with curated datasets across diverse domains, and train EMMA-500, a large-scale multilingual language model.

Abstract: In this work, we introduce EMMA-500, a large-scale multilingual language model continue-trained on texts across 546 languages designed for enhanced multilingual performance, with a focus on improving language coverage for low-resource languages. To facilitate continual pre-training, we compile the MaLA corpus, a comprehensive multilingual dataset and enrich it with curated datasets across diverse domains. Leveraging this corpus, we conduct extensive continual pre-training of the Llama 2 7B model, resulting in EMMA-500, which demonstrates robust performance across a wide collection of benchmarks, including a comprehensive set of multilingual tasks and PolyWrite, an open-ended generation benchmark developed in this study. Our results highlight the effectiveness of continual pre-training in expanding large language models’ language capacity, particularly for underrepresented languages, demonstrating significant gains in cross-lingual transfer, task generalization, and language adaptability.

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9771

Loading