Teaching wiser, Learning smarter: Multi-stage Decoupled Relational Knowledge Distillation with Adaptive Stage Selection

21 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: relation-based knowledge distillation, multi-stage, decouple, contrastive learning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Due to the effectiveness of contrastive-learning-based knowledge distillation methods, there has been a renewed interest on relational knowledge distillation. However, these methods primarily rely on the transfer of angle-wise information between samples, using only the normalized penultimate layer's output as the knowledge base. Our experiments demonstrate that properly harnessing relational information derived from intermediate layers can further improve the effectiveness of distillation. Meanwhile, we found that simply adding distance-wise relational information to contrastive-learning-based methods negatively impacts distillation quality, revealing an implicit contention between angle-wise and distance-wise attributes. Therefore, we propose a ${\bf{M}}$ulti-stage ${\bf{D}}$ecoupled ${\bf{R}}$elational (MDR) knowledge distillation framework equipped with an adaptive stage selection to identify the stages that maximize the efficacy of transferring the relational knowledge. Furthermore, our framework decouples angle-wise and distance-wise information to resolve their conflicts while still preserves complete relational knowledge, thereby resulting in an elevated transferring efficiency and distillation quality. To evaluate the proposed method, we conduct extensive experiments on multiple image benchmarks ($\textit{i.e.}$ CIFAR100, ImageNet and Pascal VOC), covering various tasks ($\textit{i.e.}$ classification, few-shot learning, transfer learning and object detection). Our method exhibits superior performance under diverse scenarios, surpassing the state of the art by an average improvement of 1.08\% on CIFAR-100 across extensively utilized teacher-student network pairs.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2985
Loading