Pre$^3$: Enabling Deterministic Pushdown Automata for Faster Structured LLM Generation

ACL ARR 2025 February Submission7752 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Extensive LLM applications demand efficient structured generations, particularly for LR(1) grammars, to produce outputs in specified formats (e.g JSON). Existing methods primarily parse LR(1) grammars into a pushdown automaton (PDA), leading to runtime execution overhead for context-dependent token processing, especially inefficient under large inference batches. We therefore propose Pre$^3$ that exploits deterministic pushdown automata (DPDA) to optimize the constrained LLM decoding efficiency. First, by **pre**computed **pre**fix-conditioned edges during the **pre**processing, Pre$^3$ enables additional preprocessing optimizations for edges and supports parallel transition processing. Second, Pre$^3$ proposes an algorithm to transform LR(1) transition graphs into DPDA, eliminating the need for runtime path exploration, enabling edge transitions with minimal overhead. Pre$^3$ can be seamlessly integrated into standard LLM inference frameworks, improving time per output token (TPOT) by up to 40\% and throughput by up to 36\% in our experiments.
Paper Type: Long
Research Area: Generation
Research Area Keywords: text-to-text generation,efficient models
Languages Studied: English
Submission Number: 7752
Loading