Earley-Driven Dynamic Pruning for Efficient Structured Decoding

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: ZapFormat is a novel dynamic pruning algorithm to accelerate structured generation in LLMs.
Abstract: Large Language Models (LLMs) have shown remarkable capabilities, yet ensuring their outputs conform to strict structural or grammatical constraints remains challenging, which is critical in function calls and domain-specific language (DSL) generation. Constrained decoding with context-free grammar is a flexible approach to guarantee LLMs' adherence to a specific format by dynamically building a token logits mask. However, creating this mask requires checking the validity of all tokens in the LLM vocabulary at every decoding step, which often incurs significant overheads in existing constrained decoding engines. To address this challenge, we propose $\textbf{ZapFormat}$, a novel $\textbf{dynamic pruning}$ strategy based on the Earley algorithm that identifies and eliminates invalid or redundant Earley states in real-time, significantly reducing memory occupation of the Earley algorithm's states. This further enables us to use a state cache to speed up structured generations on a large number of queries. We implemented ZapFormat in a new constrained decoding engine called Formatron which also incorporates existing optimizations. Through comprehensive experiments on structured generation tasks, including JSON generation, JSON Schema validation, and semantic parsing, we demonstrate that Formatron not only $\textbf{consistently maintains}$ high-precision compliant outputs but also achieves $\textbf{significant improvements}$ in inference speed up to 2x compared to state-of-the-art implementations. More importantly, Formatron is generally applicable across various LLM architectures. We release Formatron as open source at https://github.com/Dan-wanna-M/formatron.
Lay Summary: Large language models (LLMs) can draft essays, write code, and answer questions, but they sometimes stray from a required format, which, however, is necessary when integrating LLMs into a larger system. Today’s “constrained decoding” techniques keep LLMs inside the lines by checking every possible next word against a set of grammar rules, yet that safeguard slows things down because the model must re-check its entire vocabulary at every step. We introduce ZapFormat, a program that spots and discards impossible word sequences on the fly, using a classic parsing method called the Earley algorithm. By pruning away dead-end paths early, ZapFormat slashes the bookkeeping the model must do and lets us reuse its previous work through a cache. Built into our new program Formatron, this approach keeps answers perfectly in-format while making constrained decoding for structured output—like valid JSON, code snippets, or database queries—up to twice as fast. The technique works across many different LLM architectures, paving the way for quicker and more reliable AI agents in data-critical fields such as finance, healthcare, and software development.
Link To Code: https://github.com/Dan-wanna-M/formatron
Primary Area: Deep Learning->Large Language Models
Keywords: constrained decoding, structured generation, LLM inference, MLSys
Submission Number: 9267
Loading