Keywords: Tokenization, Multilingual Language Models, Vocabulary Design, Data Mixture Algorithms, Indic Languages
Abstract: While model architecture and training objectives are well-studied, tokenization,
particularly in multilingual contexts, remains a relatively neglected aspect of
Large Language Model (LLM) development. Existing tokenizers often exhibit
high token-to-word ratios, inefficient use of context length, and slower inference.
We present a systematic study that links vocabulary size, pre-tokenization rules,
and training-corpus composition to both token-to-word efficiency and model qual-
ity. To ground our analysis in a linguistically diverse context, we conduct exten-
sive experiments on Indic scripts, which present unique challenges due to their
high script diversity and orthographic complexity.
Drawing on the insights from these analyses, we propose a novel algorithm for
data composition that balances multilingual data for tokenizer training. Our obser-
vations on pre-tokenization strategies significantly improve model performance,
and our data composition algorithm reduces the average token-to-word ratio by
approximately 6% with respect to the conventional data randomization approach.
Our tokenizer achieves more than 40% improvement on average token-to-word
ratio against state-of-the-art multilingual Indic models. This improvement yields
measurable gains in both model performance and inference speed. This highlights
tokenization alongside architecture and training objectives as a critical lever for
building efficient, scalable multilingual LLMs.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 13270
Loading