Keywords: Lossless Compression, KV cache, LLM, Speculative Decoding
Abstract: Lossy KV cache compression is a well-explored subfield of machine learning efficiency, with improved latency being one of its major gains. However, lossy compression techniques can fumble from time to time, exhibiting various — and often catastrophic — failure patterns that are not only difficult to resolve but sometimes even hard to identify in the first place, making the direct deployment of models with compressed KV cache a risky endeavor. In this work, we explore a way to preserve lossless generation quality while still benefiting from the acceleration provided by attending only to a compressed KV cache. Specifically, we draw inspiration from the n-gram candidate pool decoding paradigm pioneered by Lookahead Decoding — a largely overlooked and underdeveloped way to achieve efficient yet lossless decoding — where we purposely allow the model to Fumble Around with compressed KV cache to generate multiple lossy ''n-gram guesses'' with just one forward pass, while Find Out via lossless verification in the same forward pass in truly parallel fashion. From a conceptual standpoint, our proposed framework is compatible with all typical static or dynamic KV cache compression methods from the token dropping realm, thus opening up a new avenue for the stagnant n-gram decoding paradigm. Practically, we show that — with careful system support — this framework presents many useful traits that similar draftless baselines (e.g., Self-Speculative Decoding) simply cannot achieve, such as requiring only one set of KV cache and being far less sensitive to model, task, and input-length scenarios. Our comprehensive empirical results show FAFO provides 1.20-2.71x latency speedup over the original model, while consistently outperforming other lossless + draftless solutions by a large margin.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 13383
Loading