Asymmetric Decision-Making in Online Knowledge Distillation: Unifying Consensus and Divergence

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Online Knowledge Distillation (OKD) methods represent a streamlined, one-stage distillation training process that obviates the necessity of transferring knowledge from a pretrained teacher network to a more compact student network. In contrast to existing logits-based OKD methods, this paper presents an innovative approach to leverage intermediate spatial representations. Our analysis of the intermediate features from both teacher and student models reveals two pivotal insights: (1) the similar features between students and teachers are predominantly focused on the foreground objects. (2) teacher models emphasize foreground objects more than students. Building on these findings, we propose Asymmetric Decision-Making (ADM) to enhance feature consensus learning for student models while continuously promoting feature diversity in teacher models. Specifically, Consensus Learning for student models prioritizes spatial features with high consensus relative to teacher models. Conversely, Divergence Learning for teacher models highlights spatial features with lower similarity compared to student models, indicating superior performance by teacher models in these regions. Consequently, ADM facilitates the student models to catch up with the feature learning process of the teacher models. Extensive experiments demonstrate that ADM consistently surpasses existing OKD methods across various online knowledge distillation settings and also achieves superior results when transferred to offline knowledge distillation, semantic segmentation and diffusion distillation tasks.
Lay Summary: When teaching smaller, faster AI models to learn from larger, more powerful ones—a process known as knowledge distillation—most methods rely on a two-step process: first training a large “teacher” model, and then having a “student” model learn from its outputs. Recently, a new family of methods called Online Knowledge Distillation (OKD) has made this process more efficient by letting both teacher and student models learn together in a single stage. Our research digs deeper into how these models learn from each other, focusing not just on the final predictions, but on the “intermediate” features—how the models internally process images. We found that both models tend to focus on the main objects in an image, but the teacher model pays even more attention to these important areas than the student does. Building on this, we introduce a new approach called Asymmetric Decision-Making (ADM). ADM helps the student model better match the teacher where it matters most, while encouraging the teacher to keep exploring new patterns. This leads to smarter, more effective student models. Our experiments show that ADM improves performance across a range of tasks, making knowledge distillation faster and more powerful.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Primary Area: Applications->Computer Vision
Keywords: knowledge distillation
Submission Number: 331
Loading