Direct Alignment of Draft Model for Speculative Decoding with Chat-Fine-tuned LLMs

Direct Alignment of Draft Model for Speculative Decoding with Chat-Fine-tuned LLMs

ICLR 2024 Workshop ME-FoMo Submission37 Authors

Published: 04 Mar 2024, Last Modified: 03 May 2024ME-FoMo 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: knowledge distillation, new loss function, speculative decoding, efficient inference, large language models

TL;DR: We propose a simple training framework for aligning small model with large language model using a new distillation loss built on total-variation distance, achieving 2.4x speed-up over auto-regressive decoding.

Abstract: Text generation with Large Language Models (LLMs) is known to be memory bound due to the combination of their auto-regressive nature, huge parameter counts, and limited memory bandwidths, often resulting in low token rates. Speculative decoding has been proposed as a solution for LLM inference acceleration. However, since draft models are often unavailable in the modern open-source LLM families, e.g., for Llama 2 7B, training a high-quality draft model is required to enable inference acceleration via speculative decoding. In this paper, we propose a simple draft model training framework for direct alignment to chat-capable target models. With the proposed framework, we train Llama 2 Chat Drafter 115M, a draft model for Llama 2 Chat 7B or larger, with only 1.64\% of the original size. Our training framework only consists of pretraining, distillation dataset generation, and finetuning with knowledge distillation, with no additional alignment procedure. For the finetuning step, we use instruction-response pairs generated by target model for distillation in plausible data distribution, and propose a new Total Variation Distance++ (TVD++) loss that incorporates variance reduction techniques inspired from the policy gradient method in reinforcement learning. Our empirical results show that Llama 2 Chat Drafter 115M with speculative decoding achieves up to 2.3 block efficiency and 2.4$\times$ speed-up relative to autoregressive decoding on various tasks with no further task-specific fine-tuning.

Submission Number: 37

Loading