Abstract: Speculative decoding (SD) is a promising method for accelerating Large Language Model (LLM) decoding. The speedup efficiency of SD mainly depends on the consistency between the draft model and the verify model. However, previous drafting methods usually require to train extra modules, which are challenging to obtain and be consistent with different LLMs. In this paper, we introduce CLaSp, an in- context layer skip strategy for self-speculative decoding. It requires neither additional draft- ing modules nor additional training. Instead, it employs a plug-and-play method by skipping the intermediate layers of the verify model to be a compressed draft model. Specifically, we design a dynamic programming algorithm to skip layers for current drafting, which utilizes the full hidden states from last verify stage as optimization objective. Therefore, CLaSp can dynamically adjust the layer skipping strategy based on context after each verify stage, with- out pre-optimizing a fixed set of skipped layers on amounts of training data. Experimental re- sults across various downstream tasks indicate that CLaSp achieved 1.3× ∼ 1.7× speedup on LLaMA3 series models without altering the original distribution of the generated text.
Paper Type: Long
Research Area: Generation
Research Area Keywords: Speculative Decoding, Layer-wise Sparsity
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 2364
Loading