Keywords: Tokenisation, tokenization, language modelling, compression, LLM, NLP
TL;DR: We prove that selecting a tokeniser which maximises a dataset's compression is NP-complete and does not admit a PTAS (unless P=NP), even when their inputs are defined on a binary alphabet.
Abstract: Recent works have proven tokenisation to be NP-complete.
However, their proofs' constructions rely on tokenisation being applied to inputs with alphabets of unbounded cardinality, which does not accurately reflect the real world.
Indeed, since practical applications of tokenisers involve fixed-size alphabets (e.g., Unicode or bytes), the implications of such a statement may be challenged.
In this work, we examine the computational complexity of tokenisation over bounded alphabets, considering two variants of this problem: bottom-up tokenisation and direct tokenisation, where we must, respectively, select a sequence of merge operations (in bottom-up tokenisation) or a vocabulary (in direct tokenisation) whose application compresses a dataset to at most $\delta$ symbols.
When alphabets are bounded to have only 2 characters, we do not only prove that bottom-up and direct tokenisation are NP-complete, but also that there is no polynomial-time approximation scheme for either of these problems (unless P = NP).
Furthermore, even when alphabets are bounded to contain a single character, we can still prove the NP-completeness of direct tokenisation.
Although the single-character case is not practical on its own, proving hardness results for an $n$-ary alphabet allows us to prove the same results for alphabets of any larger size.
We thus conclude that direct tokenisation over any alphabet is NP-complete, and that both bottom-up and direct tokenisation do not admit polynomial-time approximation schemes for any alphabet of size 2 or larger.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 18828
Loading