Language Models over Canonical Byte-Pair Encodings

Published: 01 May 2025, Last Modified: 19 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Tokenized language model place nontrival probability on invalid, noncanonical encodings; fixing these mistakes with our new methods improves helps!
Abstract: Modern language models represent probability distributions over character strings as distributions over (shorter) token strings derived via a deterministic tokenizer, such as byte-pair encoding. While this approach is highly effective at scaling up language models to large corpora, its current incarnations have a concerning property: the model assigns nonzero probability mass to an exponential number of *noncanonical* token encodings of each character string—these are token strings that decode to valid character strings but are impossible under the deterministic tokenizer (i.e., they will never be seen in any training corpus, no matter how large). This misallocation is both erroneous, as noncanonical strings never appear in training data, and wasteful, diverting probability mass away from plausible outputs. These are avoidable mistakes! In this work, we propose methods to enforce canonicality in token-level language models, ensuring that only canonical token strings are assigned positive probability. We present two approaches: (1) canonicality by conditioning, leveraging test-time inference strategies without additional training, and (2) canonicality by construction, a model parameterization that guarantees canonical outputs but requires training. We demonstrate that fixing canonicality mistakes improves the likelihood of held-out data for several models and corpora.
Lay Summary: Large language models represent text using a process called byte-pair encoding; under this scheme, text is broken up into chunks called tokens, which allows the model to generate text more efficiently. However, this process introduces a subtle but important problem: although each string of text has one *correct* or *canonical* tokenization under BPE, current models can mistakenly assign probability to many invalid, noncanonical versions, i.e., strings of tokens that can never occur in real BPE-encoded text. This misallocation of probability wastes modeling power and may even distort the model's understanding of language through noncanonical hallucinations. Our paper identifies and addresses this issue. We introduce two methods to ensure that language models only assign probability to valid, canonical token strings. Our first method adjusts the model at inference time—without retraining—to only consider canonical strings. Our second method builds the canonical constraint directly into the model's architecture and can be fine-tuned for better performance. We prove that both approaches improve model accuracy in theory, and we validate these improvements empirically across multiple popular models and benchmark datasets. We also introduce a new, efficient method for checking whether a token string is canonical, making our solutions practical and easy to implement. Ultimately, this work helps language models align more closely with the data on which they were trained, thereby improving reliability and reducing errors in generated text.
Primary Area: Deep Learning->Large Language Models
Keywords: tokenization, probabilistic inference, structured prediction, constrained generation
Link To Code: https://github.com/genlm/canonical-icml-2025
Submission Number: 13011
Loading