The Hidden Cost of Structured Generation in LLMs: Draft-Conditioned Constrained Decoding

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs) are increasingly used to generate executable outputs, JSON objects, and API calls, where a single syntax error can make the output unusable. Constrained decoding enforces validity token-by-token via masking and renormalization, but it can distort generation when the model assigns low probability mass to valid continuations, pushing decoding toward locally valid yet semantically incorrect trajectories. We propose \emph{Draft-Conditioned Constrained Decoding (DCCD)}, a simple two-step, training-free inference procedure that decouples semantic planning from structural enforcement: an unconstrained draft is generated first, and constrained decoding is then applied, conditioned on this draft, to guarantee validity. We analyze DCCD through a KL-projection view, showing that draft conditioning increases feasible mass and reduces the cumulative “projection tax” induced by hard constraints, with an optional best-of-$K$ draft selection. Across structured reasoning benchmarks, DCCD improves strict structured accuracy by up to +24 percentage points over standard constrained decoding (e.g., 15.2\% to 39.0\% on GSM8K with a 1B model), and enables smaller model pairs to match or exceed much larger constrained baselines, yielding substantial gains in parameter efficiency.
Lay Summary: Many AI systems are now used to generate outputs that software can execute directly, such as database queries, function calls, and structured data files. These outputs must follow strict formats, as even a minor formatting error can cause a system to fail. Existing methods enforce these rules while the AI is generating its response, but this often disrupts its reasoning process and leads to answers that are correctly formatted but factually wrong. We propose a simple alternative. Instead of forcing the AI to think and format its answer simultaneously, we first allow it to produce an unrestricted draft that captures its reasoning. We then convert that draft into the required format, ensuring the final output complies with all structural rules. This separation allows the model to focus on solving the problem before worrying about formatting. Across a range of mathematical and logical reasoning tasks, our method produces substantially more correct structured outputs than existing approaches. These results suggest that separating “thinking” from “formatting” can make AI systems more reliable when they are used as components of software tools and automated workflows.
Primary Area: Deep Learning->Large Language Models
Keywords: Structured Generation, Inference-Time Decoding, Test-Time Scaling
Originally Submitted PDF: pdf
Submission Number: 6210
Loading