Abstract: Extensive LLM applications demand efficient structured generations, particularly for LR(1) grammars, to produce outputs in specified formats (e.g JSON).
Existing methods primarily parse LR(1) grammars into a pushdown automaton (PDA), leading to runtime execution overhead for context-dependent token processing, especially inefficient under large inference batches.
We therefore propose Pre$^3$ that exploits deterministic pushdown automata (DPDA) to optimize the constrained LLM decoding efficiency.
First, by **pre**computed **pre**fix-conditioned edges during the **pre**processing, Pre$^3$ enables additional preprocessing optimizations for edges and supports parallel transition processing.
Second, Pre$^3$ proposes an algorithm to transform LR(1) transition graphs into DPDA, eliminating the need for runtime path exploration, enabling edge transitions with minimal overhead.
Pre$^3$ can be seamlessly integrated into standard LLM inference frameworks, improving time per output token (TPOT) by up to 40\% and throughput by up to 36\% in our experiments.
Paper Type: Long
Research Area: Generation
Research Area Keywords: text-to-text generation,efficient models
Languages Studied: English
Submission Number: 7752
Loading