Flexible and Efficient Grammar-Constrained Decoding

Kanghee Park; Timothy Zhou; Loris D'Antoni

Flexible and Efficient Grammar-Constrained Decoding

Kanghee Park, Timothy Zhou, Loris D'Antoni

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: This paper presents a novel approach to grammar-constrained decoding that significantly reduces preprocessing overhead while preserving state-of-the-art efficiency in online mask computation.

Abstract: Large Language Models (LLMs) are often asked to generate structured outputs that obey precise syntactic rules, such as code snippets or formatted data. Grammar-constrained decoding (GCD) can guarantee that LLM outputs matches such rules by masking out tokens that will provably lead to outputs that do not belong to a specified context-free grammar (CFG). To guarantee soundness, GCD algorithms have to compute how a given LLM subword tokenizer can ``align'' with the tokens used by a given context-free grammar and compute token masks based on this information. Doing so efficiently is challenging and existing GCD algorithms require tens of minutes to preprocess common grammars. We present a new GCD algorithm together with an implementation that offers 17.71x faster offline preprocessing than existing approaches while preserving state-of-the-art efficiency in online mask computation.

Lay Summary: Large language models (LLMs) are often used to produce structured text, such as computer code or data in specific formats. But getting them to always follow strict rules — like grammar in programming languages — can be tricky. To help with this, researchers use a technique called grammar-constrained decoding (GCD), which ensures that the model only generates text that follows the rules. However, existing methods for doing this take a long time to prepare — often tens of minutes — before they can be used. Our work introduces a faster and more efficient approach. We developed a new algorithm and tool that can prepare these grammar rules much more quickly — significantly reducing setup time — while keeping the actual text generation just as fast and accurate as before. This improvement makes it easier and more practical to use LLMs in real-world applications where following strict formats is essential, such as writing code or generating structured data for websites or databases.

Primary Area: Applications->Language, Speech and Dialog

Keywords: Language Models, Decoding, Context-free Grammars

Submission Number: 4050

Loading