Representation Consolidation from Multiple Expert TeachersDownload PDF

Published: 28 Jan 2022, Last Modified: 13 Feb 2023ICLR 2022 SubmittedReaders: Everyone
Keywords: transfer learning, distillation, pretraining, model merging
Abstract: A library of diverse expert models transfers better to a novel task than a single generalist model. However, growing such a library indefinitely is impractical. Hence, we explore the problem of learning a consolidated image feature representation from a collection of related task-specific teachers that transfer well on novel recognition tasks. This differs from traditional knowledge distillation in which a student model is trained to emulate the input/output functionality of a teacher. Indeed, we observe experimentally that standard distillation of task-specific teachers, or using these teacher representations directly, **reduces** downstream transferability compared to a task-agnostic generalist model. We show that a simple multi-head, multi-task distillation method using an unlabeled proxy dataset and adding a generalist teacher is sufficient to consolidate representations from task-specific teacher(s). We improve downstream performance, outperforming the teacher (or best of all teachers) as well as the strong baseline of ImageNet pre-trained features. Our method almost reaches the performance of a multi-task joint training oracle, reaping the benefit of the teachers without replaying their training data.
One-sentence Summary: We propose representation consolidation, which combines multiple models into one without data replay, and maximizes its representation power rather than emulating old models’ input-output functionality.
10 Replies

Loading