Logit-Based Losses Limit the Effectiveness of Feature Knowledge Distillation

Published: 23 Sept 2025, Last Modified: 29 Oct 2025NeurReps 2025 ProceedingsEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Knowledge Distillation, Representation Geometry, Intrinsic Dimension, Tunnel Effect
TL;DR: We leverage findings on representation intrinsic dimension to develop a novel SoTA distillation technique which does not rely on logit losses.
Abstract: Knowledge distillation (KD) methods can transfer knowledge of a parameter-heavy teacher model to a light-weight student model. The status quo for feature KD methods is to utilize loss functions based on logits (i.e., pre-softmax class scores) and intermediate layer features (i.e., latent representations). Unlike previous approaches, we propose a feature KD framework for training the student's backbone using feature-based losses \emph{exclusively} (i.e., without logit-based losses such as cross entropy). Leveraging recent discoveries about the geometry of latent representations, we introduce a \emph{knowledge quality metric} for identifying which teacher layers provide the most effective knowledge for distillation. Experiments on three image classification datasets with four diverse student-teacher pairs, spanning convolutional neural networks and vision transformers, demonstrate our KD method achieves state-of-the-art performance, delivering top-1 accuracy boosts of up to $15$% over standard approaches. We publicly share our code to facilitate future work (\texttt{anonymous.github.com}).
Submission Number: 14
Loading