TL;DR: Gumiho speeds up LLM text generation with a hybrid approach: it prioritizes accuracy for crucial early predicted tokens and uses faster parallel processing for later ones. This design leads to improved overall inference speed.
Abstract: Speculative decoding (SPD) aims to accelerate the auto-regressive token generation process of a target Large Language Model (LLM). Some approaches employ a draft model with multiple heads to predict a sequence of future tokens, where each head handles a token in the sequence. The target LLM verifies the predicted sequence and accepts aligned tokens, enabling efficient multi-token generation. However, existing methods assume that all tokens within a sequence are equally important, employing identical head structures and relying on a single-generation paradigm, either serial or parallel. To this end, we theoretically demonstrate that initial tokens in the draft sequence are more important than later ones. Building on this insight, we propose Gumiho, a hybrid model combining serial and parallel heads. Specifically, given the critical importance of early tokens, we employ a sophisticated Transformer architecture for the early draft heads in a serial configuration to improve accuracy. For later tokens, we utilize multiple lightweight MLP heads operating in parallel to enhance efficiency. By allocating more advanced model structures and longer running times to the early heads, Gumiho achieves improved overall performance. The experimental results demonstrate that our method outperforms existing approaches, fully validating its effectiveness. Our code is available at https://github.com/AMD-AIG-AIMA/Gumiho.
Lay Summary: Large Language Models (LLMs) generate text token by token, which can be slow. Speculative decoding (SPD) speeds this up by using a smaller "draft" model to predict multiple future tokens that the main LLM then verifies. This paper argues that in SPD, the initial tokens in a predicted sequence are more critical than later ones because an early error discards the entire subsequent sequence.
To address this, the researchers propose Gumiho, a hybrid SPD architecture. Gumiho uses a more sophisticated Transformer-based structure in a serial manner for the crucial early tokens to improve their accuracy. For later, less critical tokens, it employs multiple lightweight MLP heads that operate in parallel for better efficiency. This targeted allocation of resources aims to boost overall performance.
Link To Code: https://github.com/AMD-AIG-AIMA/Gumiho
Primary Area: Deep Learning->Large Language Models
Keywords: Speculative Decoding, Large Language Models
Submission Number: 1485
Loading