TL;DR: Efficient algorithms for character-level conditioning of LLMs
Abstract: Modern language models are internally—and mathematically—distributions over *token* strings rather than *character* strings, posing numerous challenges for programmers building user applications on top of them. For example, if a prompt is specified as a character string, it must be tokenized before passing it to the token-level language model. Thus, the tokenizer and consequent processing are very sensitive to the specification of the prompt (e.g., whether the prompt ends with a space or not). This paper presents algorithms for converting token-level language models to character-level ones. We present both exact and approximate algorithms. In the empirical portion of the paper, we benchmark the practical runtime and approximation quality. Across four publicly available language models, we find that—even with a small computation budget—our method is able to accurately approximate the character-level distribution at reasonably fast speeds, and that a significant improvement in the language model's compression rate (bits/byte) is achieved.
Lay Summary: Internally, modern language models don’t represent text as characters or words—they use tokens, which are variable-length chunks like `super`,`'cal`, `if`, `rag`, `il`, `ist`, `ice`, `xp`, `ial`, `id`, `ocious`. This setup makes models efficient to train and run, but introduces surprising and often frustrating behavior for users. For example, adding a single space to the end of a prompt can dramatically change the model’s output—even when the text looks the same to us.
This paper resolves the issue by transforming any existing token-based language model into a character-based one—without any retraining. The result is a model that behaves more predictably and intuitively when given character-level prompts.
But the benefits go deeper: token-based models actually assign probability mass to many different tokenizations of the same text, even though common interfaces only consider one. Our method accounts for all valid tokenizations, unlocking probability estimates that are both more accurate and more faithful to the underlying model!
Link To Code: https://github.com/genlm/genlm-bytes
Primary Area: Deep Learning->Large Language Models
Keywords: tokenization, character, bytes, tokens, language models, probabilistic reasoning, probabilistic inference
Submission Number: 12880
Loading