Comparative Knowledge Distillation

Alex Tianyi Xu, Alex Wilf, Paul Pu Liang, Alexander Obolenskiv, Daniel Fried, Louis-Philippe Morency

Published: 01 Jan 2025, Last Modified: 16 May 2025WACV 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In the era of large-scale pretrained models, Knowledge Distillation (KD) serves an important role in transferring the wisdom of computationally-heavy teacher models to lightweight, efficient student models while preserving performance. Yet KD settings often assume readily available access to teacher models capable of performing many in-ferences-a notion increasingly at odds with the realities of costly large-scale models. Addressing this gap, we study an important question: how KD algorithms fare as the number of teacher inferences decreases, a setting we term Reduced-Teacher-Inference Knowledge Distillation (RTI-KD). We observe that the performance of prevalent KD techniques and state-of-the-art data augmentation strategies suffers considerably as the number of teacher inferences is reduced. One class of approaches, termed “relational” knowledge distillation underperforms the rest, yet we hypothesize that they hold promise for reduced dependency on teacher models because they can augment the effective dataset size without additional teacher calls. We find that a simple change - performing high-dimensional comparisons instead of low-dimensional relations, which we term Comparative Knowledge Distillation - vaults performance well over existing KD approaches. We perform empirical evaluation across varied experimental settings and rigorous analysis to understand the learning outcomes of our method. All code is made publicly available.