Meta-Collaboration in Distillation: Pooled Learning from Multiple Students

Durga S; Shashank Kate; Nishant Jain; Atharva Abhijit Tambat; Ganesh Ramakrishnan; Pradeep Shenoy

Meta-Collaboration in Distillation: Pooled Learning from Multiple Students

Durga S, Shashank Kate, Nishant Jain, Atharva Abhijit Tambat, Ganesh Ramakrishnan, Pradeep Shenoy

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Supplementary Material: pdf

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Knowledge Distillation, Re-weighting, Meta-Learning

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: Knowledge distillation (KD) approximates a large teacher model using a smaller student model. KD can be used to train multiple students of different capacities, allowing for flexible management of inference costs at test time. We propose a novel distillation method we term meta-collaboration, wherein a set of students are simultaneously distilled from a single teacher and can improve each other through information sharing during distillation. We model this information sharing through a separate network designed to predict instance-specific loss mixing for each of the students. This auxiliary network is trained jointly with the multi-student distillation, utilizing a separate meta-loss aggregating student model loss on a separate validation set. Our method improves student accuracy for all students and beats to state-of-the-art distillation baselines, including methods that use multi-step distillation, combining models of different sizes. In particular, addition of smaller students to the pool clearly benefits larger student models, through the mechanism of meta-collaboration. We show average gains of 2.5\% on CIFAR100 \& 2\% on TinyImageNet datasets; our gains are consistent across a wide range of student sizes, teacher sizes, and model architectures.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5456

Loading