Skip-Thinking: Chunk-wise Chain-of-Thought Distillation Enable Smaller Language Models to Reason Better and Faster
Abstract: Chain-of-thought (CoT) distillation allows a large language model (LLM) to guide a small language model (SLM) in reasoning tasks. Existing methods train the SLM to learn the long rationale in one iteration, resulting in two issues: 1) Long rationales lead to a large token-level batch size during training, resulting in over-smoothed gradients of core reasoning tokens (i.e., the token will directly affect the correctness of subsequent reasoning) as they contribute a tiny fraction of the rationale. As a result, the SLM converges to sharp minima where it fails to grasp the reasoning logic. 2) The answer response is slow, as the SLM must generate a long rationale before reaching the answer. Therefore, we propose chunk-wise training (CWT), which uses a heuristic search based on the SLM loss to divide the rationale into semantically coherent chunks and focuses the SLM on learning from one chunk per iteration. Since each chunk contains fewer tokens, the gradients of core reasoning tokens in the chunk receive greater weight during backpropagation. On the basis of CWT, skip-thinking training (STT) is proposed. STT makes the SLM skip several medium reasoning chunks to reach the answer, improving reasoning speed while maintaining accuracy. We validate our approach on a variety of SLMs and multiple reasoning tasks.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: Large Language Model,Chain of thoght,Knowledge Distillation
Contribution Types: Approaches to low-resource settings
Languages Studied: English
Submission Number: 3278
Loading