TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior

TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior

ICLR 2026 Conference Submission14842 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: tokenization, language models, multilinguality

TL;DR: We train fourteen models identical except for tokenization and evaluate tokenization effects on a custom multilingual benchmark.

Abstract: Tokenizers provide the fundamental basis through which text is represented and processed by language models (LMs). Despite the importance of tokenization, its role in LM performance and behavior is poorly understood due to the challenge of measuring the impact of tokenization in isolation. To address this need, we present TokSuite, a collection of models and a benchmark that supports research into tokenization's influence on LMs. Specifically, we train fourteen models that use different tokenizers but are otherwise identical---using the same architecture, dataset, training budget, and initialization. Additionally, we curate and release a new benchmark that specifically measures model performance subject to real-world perturbations that are likely to influence tokenization. Put together, TokSuite allows robustly decoupling the influence of a model's tokenizer, supporting a series of novel findings that elucidate the respective benefits and shortcomings of a wide range of popular tokenizers.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 14842

Loading