Skip-Thinking:  Chunk-wise Chain-of-Thought Distillation Enable Smaller Language Models to Reason Better and Faster

Skip-Thinking: Chunk-wise Chain-of-Thought Distillation Enable Smaller Language Models to Reason Better and Faster

ACL ARR 2025 February Submission3278 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Chain-of-thought (CoT) distillation allows a large language model (LLM) to guide a small language model (SLM) in reasoning tasks. Existing methods train the SLM to learn the long rationale in one iteration, resulting in two issues: 1) Long rationales lead to a large token-level batch size during training, resulting in over-smoothed gradients of core reasoning tokens (i.e., the token will directly affect the correctness of subsequent reasoning) as they contribute a tiny fraction of the rationale. As a result, the SLM converges to sharp minima where it fails to grasp the reasoning logic. 2) The answer response is slow, as the SLM must generate a long rationale before reaching the answer. Therefore, we propose chunk-wise training (CWT), which uses a heuristic search based on the SLM loss to divide the rationale into semantically coherent chunks and focuses the SLM on learning from one chunk per iteration. Since each chunk contains fewer tokens, the gradients of core reasoning tokens in the chunk receive greater weight during backpropagation. On the basis of CWT, skip-thinking training (STT) is proposed. STT makes the SLM skip several medium reasoning chunks to reach the answer, improving reasoning speed while maintaining accuracy. We validate our approach on a variety of SLMs and multiple reasoning tasks.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: Large Language Model,Chain of thoght,Knowledge Distillation

Contribution Types: Approaches to low-resource settings

Languages Studied: English

Submission Number: 3278

Loading