Attention-Level Speculation

Jack Cai; Ammar Vora; Randolph Zhang; Mark O'Connor; Mark C. Jeffrey

Attention-Level Speculation

Jack Cai, Ammar Vora, Randolph Zhang, Mark O'Connor, Mark C. Jeffrey

Published: 01 May 2025, Last Modified: 06 Aug 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We show that attention-level speculation reduces LLM decode latency when tensor parallelism fails to scale.

Abstract: As Large Language Models (LLMs) grow in size and context length, efficient inference strategies are essential to maintain low-latency token generation. Unfortunately, conventional tensor and data parallelism face diminishing returns when scaling across multiple devices. We propose a novel form—attention-level speculative parallelism (ALSpec)—that predicts self-attention outputs to execute subsequent operations early on separate devices. Our approach overlaps attention and non-attention computations, reducing the attention latency overhead at 128K context length by up to 5x and improving end-to-end decode latency by up to 1.65x, all without sacrificing quality. We establish the fundamental pillars for speculative execution and provide an execution paradigm that simplifies implementation. We show that existing attention-approximation methods perform well on simple information retrieval tasks, but they fail in advanced reasoning and math. Combined with speculative execution, we can approximate up to 90% of self-attention without harming model correctness. Demonstrated on Tenstorrent's NPU devices, we scale up LLM inference beyond current techniques, paving the way for faster inference in transformer models.

Lay Summary: When a large-language model reacts to a prompt to generate text, one of the slowest but important stages of the computation is called attention. One way to speed up attention is to use more computer chips to work on it, but once you use 8 or more computer chips, the communication between chips overwhelms the benefit of more computational ability. Another way to speed up attention is to use an approximation of its underlying math. Particular types of attention approximation from prior work can work very well in some cases, but in others, the quality of the generated text is degraded. Our paper proposes to sometimes use the approximation, but sometimes not, verifying on the fly whether the approximation was of good quality. Our experiments suggest that using our approach (attention-level speculation) with 8 computer chips can be 1.65x times faster than the conventional approach to use 8 chips to acceleration large-language models.

Link To Code: https://github.com/mcj-group/alspec

Primary Area: Deep Learning->Attention Mechanisms

Keywords: Inference, Speculation, Attention, Transformer, Large Language Model

Submission Number: 14085

Loading