Abstract: The outpouring of various pre-trained models empowers knowledge distillation~(KD) by providing abundant teacher resources. Meanwhile, exploring the massive model repository to select a suitable teacher and further extracting its knowledge become daunting challenges. Standard KD fails to surmount two obstacles when training a student with the availability of plentiful pre-trained teachers, i.e., the "faculty". First, we need to seek out the most contributive teacher in the faculty efficiently rather than enumerating all of them for a student. Second, since the teacher may be pre-trained on different tasks w.r.t. the student, we must distill the knowledge from a more general label space. This paper studies this ``faculty distillation'' where a student performs teacher assessment and generalized knowledge reuse. We take advantage of optimal transport to construct a unifying objective for both problems, which bridges the semantic gap and measures the relatedness between a pair of models. This objective can select the most relevant teacher, and we minimize the same objective over student parameters to transfer the knowledge from the selected teacher subsequently. Experiments in various settings demonstrate the succinctness and versatility of our proposed method.
0 Replies
Loading