One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: tokenizer, multilingual pre-training, less-resourced languages, cross-lingual transfer, multilingual representations, tokenization
Abstract: Pretraining massively multilingual Large Language Models (LLMs) on corpora from many languages at once is challenging due to limited model capacity, scarce high-quality data in many languages, and compute constraints. This has led to a multilingual gap in language coverage. Moreover, the lack of language coverage of the tokenizer, makes this harder to address purely at the post-training stage. In this work, we study what relatively cheap interventions early on in training improve "language plasticity", or adaptation capabilities of the model post-training to new languages. We focus on tokenizer design and propose using a universal tokenizer that is trained for more languages than the primary pretraining languages to enable nimble and efficient adaptation in expanding language coverage after pretraining. We conduct systematic experiments across 63 languages belonging to diverse typological and lexicographic language groups. Across different training strategies, we show that models with a universal tokenizer achieve significantly higher language adaptation, with up to 20.2% increase in win rates compared to the model with more conservative language group specific tokenizer. Furthermore, a universal tokenizer also enables better plasticity towards languages that are completely unseen in the tokenizer and pretraining, by up to 5% win rate gain. We achieve this adaptation to an expanded set of languages with minimal compromise in performance on the majority of languages included in pre-training.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 11453
Loading