Aligning Draft and Target in Speculative Decoding: A CoT-Aware and SR-Guided Mixed Framework

Kaiwen Liu; Yangkai Xie; Xu Yang; Yiguo Qiao

Aligning Draft and Target in Speculative Decoding: A CoT-Aware and SR-Guided Mixed Framework

Kaiwen Liu, Yangkai Xie, Xu Yang, Yiguo Qiao

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLMs, Chain-of-Thought, Speculative Decoding, Process-Level Distillation, Mixed Framework, SurpriseRatio

TL;DR: We propose a CoT-aware mixed training framework that improves speculative decoding efficiency via process-level distillation and SR-guided data selection

Abstract: Chain-of-thought (CoT) prompting enhances the reasoning of large language models (LLMs) but increases autoregressive latency; speculative decoding (SD) mitigates this via a small-drafter--large-verifier pipeline whose efficiency hinges on drafted-token acceptance. We show that training-based SD methods (e.g., EAGLE) suffer from catastrophic forgetting and distribution shift under naïve CoT supervision, and we propose a CoT-aware mixed training framework that raises acceptance without altering decoding hyperparameters by combining (i) process-level CoT distillation with feature regression to reduce forward KL divergence and improve step-wise acceptance, and (ii) SurpriseRatio (SR), a data-selection metric that anchors the distribution and prevents forgetting using minimal open-domain samples. A two-stage mixed-training schedule further balances task alignment and generalization.Experiments on two target models show that our methods achieve wall-clock speedups of $3.04\times$–$4.55\times$ across three datasets, while increasing the average acceptance length by$2.76\times$–$5.62\times$.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 24083

Loading