Generating output diversity from prompt re-tokenization

Published: 02 Mar 2026, Last Modified: 02 Mar 2026Sci4DL 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: tokenization, test-time scaling
TL;DR: A novel approach for test-time scaling using stochastic tokenization
Abstract: Large language models (LLMs) process tokens, not characters, and in principle do not have access to subtoken structure. This is often seen as a fundamental flaw of tokenization, and has been pointed out as being responsible for many problematic behaviors on character-level tasks. However, we present evidence that trained LLMs do in fact learn subtoken structure, and that it can be leveraged in a novel sampling strategy which we call ${\it tokenization\ sampling}$, whereby different token encoding of the same surface string can lead to different completions. Through experiments on benchmarks in knowledge retrieval, mathematical reasoning, and code generation, we observe that while such retokenization tends to make easy problems slightly harder, it also allows LLMs to solve otherwise impossible problems, which conventional temperature sampling would ${\it never}$ get correct. Thus, the redundancy of tokenizations of a given string offers a method of test-time data augmentation that can generate multiple-views of the same prompt. In sum, our work presents an unsung advantage of tokenized language modeling, which is a mechanism to generate a diversity of outputs from prompts which are identical at the byte level.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Submission Number: 96
Loading