Could Student Selection Be the Missing Piece for Efficient Distillation?

ICLR 2026 Conference Submission22572 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Deep learning, Transferability estimation, Knowledge distillation, Computer vision
TL;DR: Unsupervised post-distillation performance prediction based student selection given a fixed teacher.
Abstract: Selecting the optimal student architecture remains an overlooked challenge in knowledge distillation (KD). Current approaches typically rely on model size constraints or random selection, ignoring how student architecture and inductive biases impact distillation effectiveness. We formulate this as an unsupervised model selection problem, where the goal is to select the best student for a given teacher without requiring ground-truth labels or expensive training cycles. We propose a transferability metric based on the Neural Tangent Kernel (NTK) that quantifies function space alignment between teacher and student models. Specifically, our cross-model NTK measures the directional similarity between teacher and student gradient vectors on unlabeled data, capturing how effectively the student can mimic the teacher's function through gradient-based optimization. Unlike existing transferability metrics that require ground-truth labels and focus on model-dataset relationships, our approach directly models the model-model relationship central to KD. To ensure practical applicability with modern networks, we implement an efficient approximation using Johnson-Lindenstrauss random projections that preserves gradient inner products without computing full NTK matrices. Experiments demonstrate that our metric is robust, and reliably predicts the post-distillation performance, outperforming existing transferability scores adapted for KD and baseline selection strategies, even in low-data scenarios. Our approach enables efficient identification of compatible student architectures before training, eliminating the need for resource-intensive trial-and-error in model compression pipelines.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 22572
Loading