Keywords: Knowledge distillation, robustness, distantly-supervised named entity recognition, noisy label learning, transfer gap
Abstract: \textit{Knowledge Distillation} (\textit{KD}) has become a cornerstone for model compression, semi-supervised learning, and self-training.
Despite its success, the standard KL-based objective suffers from a structural flaw: it \emph{couples} supervision on target and non-target classes. This coupling links the estimation of target probability mass to the loss on non-target probabilities, thereby amplifying mass mismatch and destabilizing optimization under noise or teacher miscalibration. To address this issue, we propose \emph{Target-Aware Normalized Distillation} (\textit{TAND}), a principled framework that explicitly decouples and normalizes distillation signals. TAND combines \emph{Normalized KD} (\textit{NKD}), which aligns the normalized non-target distributions of student and teacher, with \emph{Target-Aware Distillation} (\textit{TAD}), which assigns independent weights to target and non-target terms. This explicit decoupling breaks the hidden dependency in KD, stabilizes gradient dynamics, and offers direct control over supervision strength. We theoretically prove that TAND reduces gradient variance, explaining its robustness, and empirically validate its effectiveness on distantly supervised NER, noisy label learning, and transfer gap tasks. Across all settings, TAND consistently outperforms KL-based KD baselines, demonstrating strong robustness to noise across different noise levels and model architectures.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 12484
Loading