Tokenisation is NP-Complete

Philip Whittington, Gregor Bachmann, Tiago Pimentel

Published: 01 Jan 2025, Last Modified: 07 Oct 2025ACL (1) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this work, we prove the NP-completeness of two variants of tokenisation, defined here as the problem of compressing a dataset to at most 𝛿 symbols by either finding a vocabulary directly (_direct_ tokenisation), or selecting a sequence of merge operations (_bottom-up_ tokenisation).

External IDs:dblp:conf/acl/WhittingtonBP25