Tokenizer-Aware Cross-Lingual Adaptation of Decoder-Only LLMs through Embedding Relearning and Swapping
Abstract: Extending Large Language Models (LLMs) to support new languages is a challenging problem, with most methods proposed suffering from high computational cost and catastrophic forgetting of original model capabilities.
Embedding relearning~\citep{artetxe-etal-2020-cross}, a technique that creates new tokenizers and tunes embeddings on fixed model weights for target language adaptation, is both light-weight and performant. However, it has only been demonstrated to work for older generation encoder-only models and for high resource languages.
In this paper, we extend this framework to decoder-only LLMs focusing on joint adaptation to many languages, including low-resource languages. We experiment in three language groups over 100 languages each. Our approach adapts a pre-trained model via switching to a customized tokenizer, and relearning the embedding layer.
Across 96 diverse languages spanning both classification and generation tasks, we demonstrate embedding relearning improves $\texttt{Gemma2}$ models (with up to 27B parameters) by up to 20\%, being more effective than, or on par with, full-weight updating baselines while effectively mitigating English forgetting (1-3\% regressions).
Analysis reveals the critical role of customizing tokenizers in achieving effective language transfer, particularly for non-Latin script languages.
We further show embedding relearning helps transfer reasoning abilities across languages, achieving a 14\% improvement over a math-optimized LLM across 20 languages.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: cross-lingual transfer, multilingual pre-training, less-resourced languages
Contribution Types: NLP engineering experiment
Languages Studied: 212 languages from South East Asia, 392 languages from Africa, and 170 Indic languages.
Submission Number: 2239
Loading