SD$^2$: Self-Distilled Sparse Drafters

Mike Lasby; Nish Sinnadurai; Valavan Manohararajah; Sean Lie; Yani Ioannou; Vithursan Thangarasa

SD$^2$: Self-Distilled Sparse Drafters

Mike Lasby, Nish Sinnadurai, Valavan Manohararajah, Sean Lie, Yani Ioannou, Vithursan Thangarasa

Published: 11 Jun 2025, Last Modified: 10 Jul 2025ES-FoMo IIIEveryoneRevisionsBibTeXCC BY 4.0

Keywords: llm, speculative decoding, sparsity, 2:4, pruning, compression, quantization, distillation, synthetic data, supervised fine-tuning

TL;DR: We investigate the use of self-data distillation and fine-grained weight sparsity to create highly efficient draft models for accelerating speculative decoding.

Abstract: Speculative decoding is a powerful technique for reducing the latency of Large Language Models (LLMs), offering a fault-tolerant framework that enables the use of highly compressed draft models. In this work, we introduce Self-Distilled Sparse Drafters (SD$^2$), a novel methodology that leverages self-data distillation and fine-grained weight sparsity to produce highly efficient and well-aligned draft models. SD$^2$ systematically enhances draft token acceptance rates while significantly reducing Multiply-Accumulate operations (MACs), even in the Universal Assisted Generation (UAG) setting, where draft and target models originate from different model families. On a Llama-3.1-70B target model, SD$^2$ provides a 1.59× higher Mean Accepted Length (MAL) compared to layer-pruned draft models and reduces MACs by over 43.87% with a 8.36% reduction in MAL compared to dense draft models. Our 1.5B and 3B unstructured sparse drafters outperform both dense and layer-pruned models of equivalent size in terms of end-to-end latency improvements; highlighting the potential of sparsity-aware fine-tuning and compression strategies to improve LLM inference efficiency while maintaining alignment with target models.

Submission Number: 78

Loading