Keywords: language models, tokenization, automata, transducers
TL;DR: We present a method for converting a language model over one set of tokens into a language model over another set of tokens
Abstract: Modern language models define distributions over strings, but their outputs are not always suited to downstream task.
For instance, a model generating byte-pair strings may not be suitable when word-level predictions are needed, and a DNA model may not fit applications requiring amino acids. In such cases, a deterministic string-to-string transformation can convert the model's output to the desired form. This is a familiar pattern in probability theory: applying a function $f$ to a random variable $X\sim p$ yields a transformed random variable $f(X)$ with an induced distribution. While such transformations are occasionally used in language modeling, they are not treated as yielding new, fully functional language models. We formalize this perspective and introduce a general framework for language models derived from deterministic string-to-string transformations. Focusing on transformations representable as finite-state transducers---a commonly used state-machine abstraction for efficient string-to-string mappings---we develop algorithms that compose a language model with an FST to *marginalize* over source strings mapping to a given target. This allows us to propagate probabilities through the transducer without altering model parameters and to *condition* on transformed outputs. We present an exact algorithm, an efficient approximation, and a theoretical analysis. We conduct experiments in three domains: converting token-level language models to character-level language models, token-level language models to word-level models, and deriving amino-acid models from DNA models. This demonstrates inference-time adaptation of pretrained language models to match application-specific output requirements.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 13495
Loading