Keywords: Knowledge distillation, Pretrained models, Mutual information, Sharpness Aware Minimization, Mixture-of-Experts
Abstract: Transferring the world knowledge encoded in pretrained models through knowledge distillation is an effective approach to improve the performance of small, task-specific production models. However, the effectiveness of such knowledge transfer drops greatly for strong models that are pretrained in a large scale. In this paper, we explore methods to preprocess strong pretrained models to improve the effectiveness of its knowledge transfer. From a mutual information perspective of distillation effectiveness, we propose to incorporate mutual information-aware optimization into the fine-tuning of strong pretrained models. For small or highly-imbalanced downstream datasets where such optimization is less effective, we further propose to heuristically reweight the MLP blocks, which is inspired by our observation that top MLP blocks often cause the loss of mutual information. Our method enables small student models to benefit from those pretrained models among the strongest.
Primary Area: transfer learning, meta learning, and lifelong learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9453
Loading