Keywords: Knowledge Distillation, Data Pruning
Abstract: Knowledge distillation (KD) is a widely used framework for transferring knowledge from a teacher model to a student model. While prior studies have mainly attributed teacher quality to accuracy or the teacher-student capacity gap, it remains unclear how the optimal teacher changes under limited distillation data, such as in data pruning scenarios. To address this gap, we systematically study teacher selection in low-data KD by varying teacher width, training stage, and output structure under data pruning. Our experiments on CIFAR-100 and ImageNet show that the optimal teacher strongly depends on the amount of available data: Smaller, less-confident, or early-epoch teachers outperform larger or fully trained teachers in low-data regimes. We further show that effective teachers in such cases exhibit similar properties in terms of their output distribution, particularly in non-target class predictions. Finally, we show that modifying non-target logits can improve KD performance without retraining the teacher.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 156
Loading