Efficient Knowledge Distillation via Salient Feature Masking

Efficient Knowledge Distillation via Salient Feature Masking

TMLR Paper5285 Authors

03 Jul 2025 (modified: 03 Nov 2025)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Traditional Knowledge Distillation (KD) transfers all outputs from a teacher model to a student model, often introducing knowledge redundancy. This redundancy dilutes critical information, leading to degraded student model performance. To address this, we propose Salient Feature Masking for Knowledge Distillation (SFKD), a lightweight enhancement that masks out less informative components and selectively distills only the top-K activations. SFKD is a drop-in modification applicable to both logit-based and feature-based KD, incurs negligible overhead, and sharpens the student’s learning signal. Empirically, SFKD yields consistent gains across architectures (ConvNeXt, ViT) and scales (CIFAR-100: +5.44 pp; CUB: +6.39 pp; ImageNet-1K: +3.57 pp). We also provide intuition from the Information Bottleneck perspective to motivate why filtering out less salient teacher signals benefits the student. Overall, SFKD is a simple, empirically validated method for training student models that are both leaner and more accurate.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: We repositioned the paper as an empirical study of SFKD, removing Proposition 1 and using IB solely as motivation.

Assigned Action Editor: ~Martha_White1

Submission Number: 5285

Loading