Tokenizer-Aware Cross-Lingual Adaptation of Decoder-Only LLMs through Embedding Relearning and Swapping

Tokenizer-Aware Cross-Lingual Adaptation of Decoder-Only LLMs through Embedding Relearning and Swapping

ACL ARR 2025 May Submission2239 Authors

18 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Extending Large Language Models (LLMs) to support new languages is a challenging problem, with most methods proposed suffering from high computational cost and catastrophic forgetting of original model capabilities. Embedding relearning~\citep{artetxe-etal-2020-cross}, a technique that creates new tokenizers and tunes embeddings on fixed model weights for target language adaptation, is both light-weight and performant. However, it has only been demonstrated to work for older generation encoder-only models and for high resource languages. In this paper, we extend this framework to decoder-only LLMs focusing on joint adaptation to many languages, including low-resource languages. We experiment in three language groups over 100 languages each. Our approach adapts a pre-trained model via switching to a customized tokenizer, and relearning the embedding layer. Across 96 diverse languages spanning both classification and generation tasks, we demonstrate embedding relearning improves $\texttt{Gemma2}$ models (with up to 27B parameters) by up to 20\%, being more effective than, or on par with, full-weight updating baselines while effectively mitigating English forgetting (1-3\% regressions). Analysis reveals the critical role of customizing tokenizers in achieving effective language transfer, particularly for non-Latin script languages. We further show embedding relearning helps transfer reasoning abilities across languages, achieving a 14\% improvement over a math-optimized LLM across 20 languages.

Paper Type: Long

Research Area: Multilingualism and Cross-Lingual NLP

Research Area Keywords: cross-lingual transfer, multilingual pre-training, less-resourced languages

Contribution Types: NLP engineering experiment

Languages Studied: 212 languages from South East Asia, 392 languages from Africa, and 170 Indic languages.

Submission Number: 2239

Loading