The Disparate Impacts of Speculative Decoding

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Speculative Decoding, Fairness, Multilingual LLMs, Knowledge Distillation
TL;DR: We highlight the presence of speed-up disparities in speculative decoding, outline reasons why they emerge, and how to mitigate them
Abstract: The practice of speculative decoding, whereby inference is probabilistically supported by a smaller, cheaper, `drafter' model, has become a standard technique for systematically reducing the decoding time of large language models. This paper conducts an analysis of speculative decoding through the lens of its potential disparate speed-up rates across tasks. Crucially, the paper shows that speed-up gained from speculative decoding is not uniformly distributed across tasks, consistently diminishing for under-fit, and often underrepresented tasks. To better understand this phenomenon, we derive an analysis to quantify this observed ``unfairness'' and draw attention to the factors that motivate such disparate speed-ups to emerge. Further, guided by these insights, the paper proposes a mitigation strategy designed to reduce speed-up disparities and validates the approach across several model pairs, revealing up to a 76.7\% improvement in our fairness metric.
Supplementary Material: pdf
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 12433
Loading