Keywords: LLMs, Chain-of-Thought, Speculative Decoding, Process-Level Distillation, Mixed Framework, SurpriseRatio
TL;DR: We propose a CoT-aware mixed training framework that improves speculative decoding efficiency via process-level distillation and SR-guided data selection
Abstract: Chain-of-thought (CoT) prompting enhances the reasoning of large language models (LLMs) but increases autoregressive latency; speculative decoding (SD) mitigates this via a small-drafter--large-verifier pipeline whose efficiency hinges on drafted-token acceptance. We show that training-based SD methods (e.g., EAGLE) suffer from catastrophic forgetting and distribution shift under naïve CoT supervision, and we propose a CoT-aware mixed training framework that raises acceptance without altering decoding hyperparameters by combining (i) process-level CoT distillation with feature regression to reduce forward KL divergence and improve step-wise acceptance, and (ii) SurpriseRatio (SR), a data-selection metric that anchors the distribution and prevents forgetting using minimal open-domain samples. A two-stage mixed-training schedule further balances task alignment and generalization.Experiments on two target models show that our methods achieve wall-clock speedups of $3.04\times$–$4.55\times$ across three datasets, while increasing the average acceptance length by$2.76\times$–$5.62\times$.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 24083
Loading