Archiving Submission: No (non-archival)
Keywords: tokenization, language models, robustness
TL;DR: Language models are surprisingly robust to non-canonical tokenizations of the input, which can even lead to improved performance
Abstract: Modern tokenizers employ deterministic algorithms to map text into a single ``canonical" token sequence, yet the same string can be encoded as many non-canonical tokenizations using the language model vocabulary, including tokenizing by character.
In this paper, we investigate the robustness of LMs to input encoded with non-canonical tokenizations entirely unseen during training.
Surprisingly, when evaluated across 20 benchmarks, we find that instruction-tuned models retain up to 93.4\% of their original performance when given a randomly sampled tokenization, and 90.8\% with character-level tokenization.
We find that overall stronger models tend to be more robust, and that robustness diminishes as the tokenization departs farther from the canonical form.
Motivated by these results, we then identify settings where non-canonical tokenization schemes can $\textit{improve}$ performance, finding that character‑level segmentation improves string manipulation and code understanding tasks by up to 15\%, and right‑aligned digit grouping enhances large‑number arithmetic by over 33\%.
Finally, we investigate the source of this robustness, finding that it arises in the instruction-tuning phase.
We show that while both base and post-trained models grasp the semantics of non-canonical tokenizations (perceiving them as containing misspellings), base models try to mimic the imagined mistakes and degenerate into nonsensical output, while post-trained models are committed to fluent responses.
Overall, our findings suggest that models are less tied to their tokenizer than previously believed, and highlight the promise of intervening on tokenization at inference time to boost performance.
Submission Number: 25
Loading