Enhancing Logits Distillation with Plug&Play Kendall's $\tau$ Ranking Loss

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Plug & Play Ranking Loss for Logits Distillation Abstract:
Abstract: Knowledge distillation typically minimizes the Kullback–Leibler (KL) divergence between teacher and student logits. However, optimizing the KL divergence can be challenging for the student and often leads to sub-optimal solutions. We further show that gradients induced by KL divergence scale with the magnitude of the teacher logits, thereby diminishing updates on low-probability channels. This imbalance weakens the transfer of inter-class information and in turn limits the performance improvements achievable by the student. To mitigate this issue, we propose a plug-and-play auxiliary ranking loss based on Kendall’s $\tau$ coefficient that can be seamlessly integrated into any logit-based distillation framework. It supplies inter-class relational information while rebalancing gradients toward low-probability channels. We demonstrate that the proposed ranking loss is largely invariant to channel scaling and optimizes an objective aligned with that of KL divergence, making it a natural complement rather than a replacement. Extensive experiments on CIFAR-100, ImageNet, and COCO datasets, as well as various CNN and ViT teacher-student architecture combinations, demonstrate that our plug-and-play ranking loss consistently boosts the performance of multiple distillation baselines.
Lay Summary: Knowledge distillation transfers capabilities from a powerful teacher model to a lightweight student model. However, existing distillation losses overlook low-probability channels and suffer from suboptimal optimization, limiting the transfer of inter-class relational knowledge and hindering performance gains. To mitigate this issue, we propose a plug-and-play auxiliary ranking loss based on Kendall’s $\tau$ coefficient. It supplies low-probability channel information and aligns optimization objectives, seamlessly integrating with most distillation frameworks. Extensive experiments across multiple datasets and various teacher-student architecture combinations demonstrate that our plug-and-play ranking loss consistently boosts the performance of multiple distillation baselines.
Primary Area: General Machine Learning->Supervised Learning
Keywords: Knowledge Distillation, Kendall's tau Coefficient, Ranking Loss
Submission Number: 5974
Loading