Keywords: tokenization, evaluation
Abstract: Language models (LMs) are trained over sequences of tokens, whereas users interface with LMs via text. When a user (unknowingly) ends their prompt in the *middle* of the expected next-token, the predicted next-token distribution becomes distorted. While this phenomenon has been extensively documented in prior work using arbitrary character prefixes, less attention has been paid to how often it occurs in realistic prompts that adhere to word boundaries, or whether the distortion persists in these cases. In this work, we identify three domains where token boundaries commonly do not line up with semantic or syntactic ones: languages that do not use whitespace, highly compounding languages, and code. For instance, we find that in Chinese text, up to 25% of word boundaries do not line up with any token boundary, meaning that even prompts ending with complete words are susceptible to probability distortion. We then systematically construct semantically natural prompts that end with a partial token and measure the effect on predictions. We find that these constructions comprise a serious failure mode: frontier LMs consistently place two orders of magnitude less probability on the correct continuation compared to when the prompt is "backed-off" to be token-aligned, despite being given strictly more context. Moreover, this phenomenon exhibits inverse scaling, with probability distortion increasing for larger models. Finally, we evaluate $\texttt{ByteSampler}$, a recently proposed sampling-time fix for the tokenization boundary problem, and find that it effectively and efficiently overcomes the problem, exceeding the performance of heuristic token backoff. Overall, we demonstrate the scale and severity of probability distortion caused by tokenization in realistic use cases, and recommend that model inference providers adopt an inference-time fix by default at every prompt boundary.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 24091
Loading