CLaSp: In-Context Layer Skip for Self-Speculative Decoding

ACL ARR 2024 December Submission2364 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Speculative decoding (SD) is a promising method for accelerating Large Language Model (LLM) decoding. The speedup efficiency of SD mainly depends on the consistency between the draft model and the verify model. However, previous drafting methods usually require to train extra modules, which are challenging to obtain and be consistent with different LLMs. In this paper, we introduce CLaSp, an in- context layer skip strategy for self-speculative decoding. It requires neither additional draft- ing modules nor additional training. Instead, it employs a plug-and-play method by skipping the intermediate layers of the verify model to be a compressed draft model. Specifically, we design a dynamic programming algorithm to skip layers for current drafting, which utilizes the full hidden states from last verify stage as optimization objective. Therefore, CLaSp can dynamically adjust the layer skipping strategy based on context after each verify stage, with- out pre-optimizing a fixed set of skipped layers on amounts of training data. Experimental re- sults across various downstream tasks indicate that CLaSp achieved 1.3× ∼ 1.7× speedup on LLaMA3 series models without altering the original distribution of the generated text.
Paper Type: Long
Research Area: Generation
Research Area Keywords: Speculative Decoding, Layer-wise Sparsity
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 2364
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview