Abstract: Traditional Knowledge Distillation (KD) transfers all outputs from a teacher model to a student model, often introducing knowledge redundancy. This redundancy dilutes critical information, leading to degraded student model performance. To address this, we propose Salient Feature Masking for Knowledge Distillation (SFKD), a lightweight enhancement that masks out less informative components and selectively distills only the top-K activations. SFKD is a drop-in modification applicable to both logit-based and feature-based KD, incurs negligible overhead, and sharpens the student’s learning signal. Empirically, SFKD yields consistent gains across architectures (ConvNeXt, ViT) and scales (CIFAR-100: +5.44 pp; CUB: +6.39 pp; ImageNet-1K: +3.57 pp). We also provide intuition from the Information Bottleneck perspective to motivate why filtering out less salient teacher signals benefits the student. Overall, SFKD is a simple, empirically validated method for training student models that are both leaner and more accurate.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We repositioned the paper as an empirical study of SFKD, removing Proposition 1 and using IB solely as motivation.
Assigned Action Editor: ~Martha_White1
Submission Number: 5285
Loading