Keywords: tokenization, language models, robustness
TL;DR: Language models are surprisingly robust to non-canonical tokenizations of the input, which can even lead to improved performance
Abstract: Modern tokenizers employ deterministic algorithms to map text into a single ``canonical" token sequence, yet the same string can be encoded as many non-canonical tokenizations using the language model vocabulary, including tokenizing by character. In this paper, we investigate the robustness of LMs to input encoded with non-canonical tokenizations entirely unseen during training. Surprisingly, when evaluated across 20 benchmarks, we find that instruction-tuned models retain up to 93.4\% of their original performance when given a randomly sampled tokenization, and 90.8\% with character-level tokenization.  We find that overall stronger models tend to be more robust, and that robustness diminishes as the tokenization departs farther from the canonical form.  Motivated by these results, we identify settings where non-canonical tokenization schemes can \textit{improve} performance, finding that character‑level segmentation improves string manipulation and code understanding tasks by up to 15\%, and right‑aligned digit grouping enhances large‑number arithmetic by over 33\%. Finally, we investigate the source of this robustness, finding that it arises in the instruction-tuning phase. We provide evidence that both base and post-trained models grasp the semantics of non-canonical tokenizations (perceiving them as containing misspellings). However, base models try to mimic the imagined mistakes and degenerate into nonsensical output, while post-trained models are committed to fluent responses. Overall, our findings suggest that models are less committed to their tokenizer than previously believed, and highlight the promise of intervening on tokenization at inference time to boost language model performance.
Supplementary Material:  zip
Primary Area: Evaluation (e.g., methodology, meta studies, replicability and validity, human-in-the-loop)
Submission Number: 25369
Loading