Tokenisation over Bounded Alphabets is Hard

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Tokenisation, tokenization, language modelling, compression, LLM, NLP
TL;DR: We prove that selecting a tokeniser which maximises a dataset's compression is NP-complete and does not admit a PTAS (unless P=NP), even when their inputs are defined on a binary alphabet.
Abstract: Recent works have shown that tokenisation is $\mathsf{NP}$-complete. However, these works assume tokenisation is applied to inputs with unboundedly large alphabets—an unrealistic assumption, given that in practice tokenisers operate over fixed-size alphabets, such as bytes or Unicode-characters. We close this gap by analysing tokenisation over bounded alphabets, considering two natural variants: bottom-up tokenisation and direct tokenisation, where we must, respectively, select a sequence of merge operations or a vocabulary whose application optimally compresses a dataset. We prove that even with binary alphabets, both variants are not only $\mathsf{NP}$-complete, but also $\mathsf{APX}$-hard and thus admit no polynomial-time approximation scheme (unless $\mathsf{P}=\mathsf{NP}$). We further show that direct tokenisation remains $\mathsf{NP}$-complete even when applied to unary alphabets. These results establish that the computational intractability of tokenisation is not an artifact of large alphabets or complex constructions, but a fundamental barrier. Overall, our results explain why current practical algorithms such as BPE and UnigramLM are heuristic, and point toward approximation algorithms being an important path going forward for tokenisation research.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 18828
Loading