FOLD: Fast Correct Speculative Decoding

FOLD: Fast Correct Speculative Decoding

ICLR 2026 Conference Submission16674 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: speculative decoding, early exit, inference acceleration, large language models

TL;DR: We introduce FOLD(Fast cOrrect specuLative Decoding) to accelerate the inference speed of Large Language Models (LLMs) by fast correcting wrong token.

Abstract: Speculative decoding accelerates Large Language Model (LLM) inference by using a small, fast 'draft' model to propose tokens that a larger 'target' model then verifies in a single, parallel step. While this paradigm has become the standard for high-throughput inference, the community's focus has been almost entirely on a single metric: maximizing the acceptance rate of drafted tokens. We argue this is a critical oversight. The true bottleneck is not just acceptance, but the catastrophic computational cost of rejection. A single rejected token triggers a cascading failure, discarding all subsequent work and nullifying potential gains. We introduce Fast cOrrect specuLative Decoding (FOLD), a framework that fundamentally reframes the problem from merely avoiding rejection to instantly recovering from it. FOLD transforms the verification step itself. Instead of a simple pass/fail check, our novel verifier uses an integrated Early Exit module to proactively generate high-probability alternative sequences in parallel. When the primary draft fails, FOLD doesn't discard the computation; it seamlessly pivots to a pre-computed, correct path. This turns a catastrophic failure into a minor course correction, salvaging the entire speculative branch. Extensive experiments show that by treating rejection as an opportunity for correction, not a point of failure, FOLD achieves up to a 4.09$\times$ speedup over Auto Regression decoding, setting a new bar for inference efficiency. We anonymously open-source our project at https://anonymous.4open.science/r/iclr26-fold.

Primary Area: generative models

Submission Number: 16674

Loading