Talking Models: Distill Pre-trained Knowledge to Downstream Models via Interactive Communication

24 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: transfer learning, meta learning, and lifelong learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Knowledge Distillation, Interactive Communication, Distill Foundation Model
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Many recent breakthroughs in machine learning have been enabled by the pre-trained foundation models. By scaling up model parameters, training data, and computation resources, foundation models have significantly advanced the state-of-the-art in many applications. However, it is still an open question of how to use these models to perform downstream tasks efficiently. Knowledge distillation (KD) has been explored to tackle this challenge. KD is a technique that transfers knowledge from a large teacher model to a smaller student model. While KD has been successful in improving student model performance, recent research has discovered that a powerful teacher does not necessarily lead to a powerful student, due to their huge capacity gap. In addition, the potential distribution shifts between the pre-training data and downstream tasks can make knowledge transfer in KD sub-optimal for improving downstream task performance. In this paper, we extend the knowledge distillation paradigm by introducing an interactive communication process to help student models of downstream tasks learn effectively from pre-trained foundation models. Our design is inspired by the way humans learn from teachers who can explain knowledge in a way that meets the students' needs. Specifically, we let each model (i.e., student and teacher) train two components: (1) an encoder which encodes the model's hidden states to a message in a shared message space with other models and (2) a decoder which decodes any message to its own hidden states. With encoder and decoder, not only can the teacher model transfer rich information by encoding its hidden states to messages, but also the student model can send messages with information of downstream tasks to teacher so that the teacher can interpret and generate responses. With this interactive communication process, knowledge passing from teacher to student can be tailored to the student's model capacity and downstream tasks' distributions. We conducted experiments on benchmark datasets for computer vision and recommendation tasks to show that our communication mechanism outperforms state-of-the-art distillation techniques.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: pdf
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8647
Loading