Tokenisation is NP-Complete

ICML 2025 Workshop TokShop Submission12 Authors

Published: 10 Jun 2025, Last Modified: 11 Jun 2025TokShopEveryoneRevisionsBibTeXCC BY 4.0
Archiving Submission: No (non-archival)
Previous Venue If Non Archival: ACL2025
Keywords: Tokenisation, Computational Complexity, Compression
TL;DR: We prove the NP-completeness of two variants of tokenisation.
Abstract: In this work, we prove the NP-completeness of two variants of tokenisation, defined as the problem of compressing a dataset to at most $\delta$ symbols by either finding a vocabulary directly (_direct_ tokenisation), or selecting a sequence of merge operations (_bottom-up_ tokenisation).
Submission Number: 12
Loading