Abstract: In this work, we prove the NP-completeness of two variants of tokenisation, defined as the problem of compressing a dataset to at most $\delta$ symbols by either finding a vocabulary directly (direct tokenisation), or selecting a sequence of merge operations (bottom-up tokenisation).
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: Pre-training, subword representations, vocabulary learning
Contribution Types: Theory
Languages Studied: N/A
Submission Number: 177
Loading