Pitfalls, Subtleties, and Techniques in Automata-Based Subword-Level Constrained Generation

Pitfalls, Subtleties, and Techniques in Automata-Based Subword-Level Constrained Generation

ICML 2025 Workshop TokShop Submission51 Authors

Published: 10 Jun 2025, Last Modified: 23 Jun 2025TokShopEveryoneRevisionsBibTeXCC BY 4.0

Archiving Submission: Yes (archival)

Keywords: tokenization, constrained generation, automata theory

TL;DR: We discuss tokenizer-aware, finite-state-transducer-based subword-level constrained generation implementation pitfalls and details.

Abstract: Constrained generation, where language models are forced to output text that adheres to a specified format, is a powerful tool for many tasks. Several libraries implement variants of it as the foundation for a larger feature set. In implementing our own version, we uncovered many subtle problems (some of which are present in existing libraries) that can affect the downstream performance of models that use constrained decoding. Here, we describe common pitfalls and techniques when implementing robust constrained generation which apply to all major tokenizer families. Furthermore, we address favorable properties of our character-to-canonical pipeline (ease of use, efficiency, modularity, etc.). We hope this work guides you and your tokens to reliable, correct constrained outputs. Our implementation can be found [here](https://github.com/mcognetta/constrainedgenerationpitfalls).

Submission Number: 51

Loading