Track: Tiny Paper Track (between 2 and 4 pages)
Keywords: language models, pretraining, tokenization
TL;DR: We show that a simple stochastic tokenization method—randomly splitting tokens before pretraining—dramatically improves subword-level understanding in language models without any compromise to benchmark performance or increase in training cost.
Abstract: Despite impressive performance, large language models (LLMs) still struggle with seemingly simple questions such as "How many r's are in 'strawberry'?" This limitation highlights that LLMs are unable to understand how humans `see' language. We attempt to address this by experimenting with stochastic tokenization schemes in which the same text may be tokenized into multiple possible token sequences. We find that using stochastic tokenization during pretraining dramatically alters the representations learned and allows LLMs to capture understanding of fine-grained spelling-level detail in addition to the structure learned with standard tokenization. We demonstrate this by showing that LLMs pretrained with standard deterministic tokenization cannot be fine-tuned to answer language-game type questions, whilst with the minimal addition of stochastic tokenization during pretraining, the corresponding LLMs perform near-perfectly. Crucially, these improvements are achieved without any performance drop on standard benchmarks or any additional training cost — the only change is a single simple, computationally cheap preprocessing step. Overall, our results suggest that embracing stochastic tokenization can help enable LLMs to better understand how humans perceive language.
Submission Number: 43
Loading