Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations

Brian Siyuan Zheng; Alisa Liu; Orevaoghene Ahia; Jonathan Hayase; Yejin Choi; Noah A. Smith

Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations

Brian Siyuan Zheng, Alisa Liu, Orevaoghene Ahia, Jonathan Hayase, Yejin Choi, Noah A. Smith

Published: 18 Sept 2025, Last Modified: 02 Feb 2026NeurIPS 2025 spotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: tokenization, language models, robustness

TL;DR: Language models are surprisingly robust to non-canonical tokenizations of the input, which can even lead to improved performance

Abstract: Modern tokenizers employ deterministic algorithms to map text into a single ``canonical" token sequence, yet the same string can be encoded as many non-canonical tokenizations using the language model vocabulary, including tokenizing by character. In this paper, we investigate the robustness of LMs to input encoded with non-canonical tokenizations entirely unseen during training. Surprisingly, when evaluated across 20 benchmarks, we find that instruction-tuned models retain up to 93.4\% of their original performance when given a randomly sampled tokenization, and 90.8\% with character-level tokenization. We find that overall stronger models tend to be more robust, and that robustness diminishes as the tokenization departs farther from the canonical form. Motivated by these results, we identify settings where non-canonical tokenization schemes can \textit{improve} performance, finding that character‑level segmentation improves string manipulation and code understanding tasks by up to 15\%, and right‑aligned digit grouping enhances large‑number arithmetic by over 33\%. Finally, we investigate the source of this robustness, finding that it arises in the instruction-tuning phase. We provide evidence that both base and post-trained models grasp the semantics of non-canonical tokenizations (perceiving them as containing misspellings). However, base models try to mimic the imagined mistakes and degenerate into nonsensical output, while post-trained models are committed to fluent responses. Overall, our findings suggest that models are less committed to their tokenizer than previously believed, and highlight the promise of intervening on tokenization at inference time to boost language model performance.

Supplementary Material: zip

Primary Area: Evaluation (e.g., methodology, meta studies, replicability and validity, human-in-the-loop)

Submission Number: 25369

Loading