Tokenisation is NP-Complete

Tokenisation is NP-Complete

ACL ARR 2025 February Submission177 Authors

04 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In this work, we prove the NP-completeness of two variants of tokenisation, defined as the problem of compressing a dataset to at most $\delta$ symbols by either finding a vocabulary directly (direct tokenisation), or selecting a sequence of merge operations (bottom-up tokenisation).

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: Pre-training, subword representations, vocabulary learning

Contribution Types: Theory

Languages Studied: N/A

Submission Number: 177

Loading