SpectrumKD: Dynamic Dataset Curation for Distribution-Aware Knowledge Distillation of Large Language Models

Mengxiang Zhang; Lingyuan Liu

SpectrumKD: Dynamic Dataset Curation for Distribution-Aware Knowledge Distillation of Large Language Models

Mengxiang Zhang, Lingyuan Liu

17 Sept 2025 (modified: 30 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Knowledge Distillation, Data Curation

Abstract: Knowledge Distillation (KD) is a critical technique for compressing large language models (LLMs) into efficient student models while preserving performance, yet its efficacy remains highly sensitive to training data quality. Current dataset curation approaches mainly focus on quality and information at the instance level, neglecting the global distribution characteristics of the entire training dataset. This oversight often results in suboptimal data selection that degrades distillation outcomes. To address this limitation, we propose SpectrumKD, a principled data curation framework that dynamically refines training datasets across epochs by leveraging the global distribution of instance difficulty. SpectrumKD constructs a difficulty spectrum over the training corpus by ranking instances based on student model evaluation, partitioning them into four distinct learning phases: Early Learning, Continuous Learning, Late Learning, and No Learning. A sliding window segmentation strategy then selects epoch-specific subsets by adaptively shifting a fixed window across the spectrum from low to high difficulty, to ensure an uniform increase in subset difficulty across training epochs. As a plug-and-play module, SpectrumKD enhances diverse white-box KD methods and model architectures with minor computational cost. Extensive experiments across multiple language model benchmarks demonstrate consistent performance gains in distilled models, with improvements observed under varied KD approaches and model families. Crucially, SpectrumKD achieves these gains without modifying core distillation algorithms, highlighting the pivotal role of dataset distribution features and data compatibility in effective LLM distillation. Our work establishes a data-centric paradigm for KD, providing both insights and tools to advance the efficiency and capability of compressed language models.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 9274

Loading