The Art of Breaking Words: Rethinking Multilingual Tokenizer Design

The Art of Breaking Words: Rethinking Multilingual Tokenizer Design

ICLR 2026 Conference Submission13270 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Tokenization, Multilingual Language Models, Vocabulary Design, Data Mixture Algorithms, Indic Languages

Abstract: While model architecture and training objectives are well-studied, tokenization, particularly in multilingual contexts, remains a relatively neglected aspect of Large Language Model (LLM) development. Existing tokenizers often exhibit high token-to-word ratios, inefficient use of context length, and slower inference. We present a systematic study that links vocabulary size, pre-tokenization rules, and training-corpus composition to both token-to-word efficiency and model qual- ity. To ground our analysis in a linguistically diverse context, we conduct exten- sive experiments on Indic scripts, which present unique challenges due to their high script diversity and orthographic complexity. Drawing on the insights from these analyses, we propose a novel algorithm for data composition that balances multilingual data for tokenizer training. Our obser- vations on pre-tokenization strategies significantly improve model performance, and our data composition algorithm reduces the average token-to-word ratio by approximately 6% with respect to the conventional data randomization approach. Our tokenizer achieves more than 40% improvement on average token-to-word ratio against state-of-the-art multilingual Indic models. This improvement yields measurable gains in both model performance and inference speed. This highlights tokenization alongside architecture and training objectives as a critical lever for building efficient, scalable multilingual LLMs.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 13270

Loading