Bilingual Adaptation of Monolingual Foundation Models

Gurpreet Gosal; Yishi Xu; Gokulakrishnan Ramakrishnan; Rituraj Joshi; Avraham Sheinin; Zhiming Chen; Biswajit Mishra; Sunil Kumar Sahu; Neha Sengupta; Natalia Vassilieva; Joel Hestness; Samujjwal Ghosh; Bokang Jia; Onkar Arun Pandit; Satheesh Katipomu; Samta Kamboj; Rahul Pal; Parvez Mullah; Soundar Balaji Doraiswamy; Karim Chami; Preslav Nakov

Bilingual Adaptation of Monolingual Foundation Models

Published: 03 Jul 2024, Last Modified: 19 Jul 2024ICML 2024 FM-Wild Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, Multilingual, Language Adaptation, Arabic

TL;DR: A recipe for adapting a monolingual LLM to another language, addressing challenges of catastrophic forgetting and tokenizer limitations with novel approaches for vocabulary extension and embedding initialization..

Abstract: We present an efficient method for adapting a monolingual Large Language Model (LLM) to another language, addressing challenges of catastrophic forgetting and tokenizer limitations. We focus this study on adapting Llama 2 to Arabic. Our two-stage approach begins with expanding the vocabulary and training only the embeddings matrix, followed by full model continual pre-training on a bilingual corpus. By continually pre-training on a mix of Arabic and English corpora, the model retains its proficiency in English while acquiring capabilities in Arabic. Our approach results in significant improvements in Arabic and slight enhancements in English, demonstrating cost-effective cross lingual transfer. We perform ablations on embedding initialization techniques, data mix ratios, and learning rates and release a detailed training recipe. To demonstrate generalizability of this approach we also adapted Llama 3 8B to Arabic and Llama 2 13B to Hindi.

Submission Number: 104

Loading