Cascaded Chain-of-Thoughts Distillation: Distilling Reasoning Capabilities from Large Language Models
Abstract: Large language models (LLMs) have shown remarkable reasoning capabilities at increased scales, spurring efforts to distill such capabilities into smaller, compact models via teacher-student learning. Previous works directly fine-tune student models on teachers' generated Chain-of-Thoughts (CoTs) data or learn it in a multi-task framework. However, these methods struggle with CoTs generalization due to spurious correlations between questions and answers, as well as inconsistencies in the logic connecting the rationales to the answers. In this paper, we propose \textbf{Cas}caded \textbf{Co}Ts \textbf{D}istillation (CasCoD), a straightforward but effective method to address these issues. Specifically, we decompose the full CoTs distillation into two comprehensive tasks and learn it in a cascade way by sharing the input prefix. By separating and cascading the tasks, CasCoD not only enables the student model to concentrate on reasoning without the distraction of answers but ensures faithful reasoning in students, thus enhancing the generalizability of CoTs. Extensive experiments and further analysis demonstrate the effectiveness of CasCoD on both in-domain and out-of-domain benchmark reasoning datasets.
Paper Type: long
Research Area: Efficient/Low-Resource Methods for NLP
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: English
0 Replies
Loading