Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding

Seongjun Yang; Gibbeum Lee; Jaewoong Cho; Dimitris Papailiopoulos; Kangwook Lee

Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding

Seongjun Yang, Gibbeum Lee, Jaewoong Cho, Dimitris Papailiopoulos, Kangwook Lee

Published: 15 Mar 2024, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: This paper presents Predictive Pipelined Decoding (PPD), an approach that speeds up decoding in Large Language Models (LLMs) while maintaining the exact same output as the original decoding. Unlike conventional strategies, PPD employs additional compute resources to parallelize the initiation of subsequent token decoding during the current token decoding. This method reduces decoding latency and reshapes the understanding of trade-offs in LLM decoding strategies. We have developed a theoretical framework that allows us to analyze the trade-off between computation and latency. Using this framework, we can analytically estimate the potential reduction in latency associated with our proposed method, achieved through the assessment of the match rate, represented as $p_\text{correct}$. The results demonstrate that the use of extra computational resources has the potential to accelerate LLM decoding. Additionally, we implement PPD and conduct preliminary experiments to empirically validate its efficacy, addressing potential practical overheads not covered by theoretical analysis.

Submission Length: Regular submission (no more than 12 pages of main content)

Video: https://drive.google.com/file/d/1QHLv7n0iEdqyeyJz4Mh57WEH8cThZiaW/view?usp=drive_link

Supplementary Material: zip

Assigned Action Editor: ~Brian_Kingsbury1

Submission Number: 1709

Loading