Archiving Submission: Yes (archival)
Keywords: tokenization, constrained generation, automata theory
TL;DR: We discuss tokenizer-aware, finite-state-transducer-based subword-level constrained generation implementation pitfalls and details.
Abstract: Constrained generation, where language models are forced to output text that adheres to a specified format, is a powerful tool for many tasks. Several libraries implement variants of it as the foundation for a larger feature set. In implementing our own version, we uncovered many subtle problems (some of which are present in existing libraries) that can affect the downstream performance of models that use constrained decoding.
Here, we describe common pitfalls and techniques when implementing robust constrained generation which apply to all major tokenizer families. Furthermore, we address favorable properties of our character-to-canonical pipeline (ease of use, efficiency, modularity, etc.). We hope this work guides you and your tokens to reliable, correct constrained outputs. Our implementation can be found [here](https://github.com/mcognetta/constrainedgenerationpitfalls).
Submission Number: 51
Loading