Distill Models by Aptitude: Efficient Reasoning Capability Distillation via Adaptive Data Curation and Overthinking Mitigation

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: distillation, data-efficient training
TL;DR: This paper introduces DynaGuide, a novel framework that optimizes the distillation process in both efficiency and performance.
Abstract: The exponentially increasing computational demands of large language models (LLMs) facilitate the distillation of knowledge or capability to smaller models. Existing distillation attempts to transfer LLMs' reasoning capabilities to compact models face critical limitations, including expensive training or annotation costs, suboptimal data selection, and flawed synthetic data due to LLMs' general tendency to overthink. This paper introduces DynaGuide, a novel framework that optimizes the distillation process in both efficiency and performance. Our approach integrates (1) Dynamic Data Selection that adaptively performs fine-grained valuable data selection during the training process, and (2) Reasoning Pattern Guidance that mitigates the overthinking problem in synthetic data by incorporating specialized guidance during fine-tuning. Extensive experiments demonstrate that DynaGuide consistently achieves stable performance improvements across models of different series and parameter scales, with gains surpassing those of baseline methods on knowledge reasoning question answering benchmarks. Our systematic ablation studies and analysis further provide valuable insights into distillation and reasoning.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 15492
Loading