Keywords: Knowledge Distillation, Vision Foundation Models, Task-Specific Distillation, Low-Rank Adaptation
Abstract: In the context of resource-constrained environments such as embedded systems, adapting reduced-size foundation models to downstream tasks has become increasingly popular. This has recently motivated the emerging setting of task-specific distillation, where a larger and a smaller version of the same foundation model are both adapted to the same downstream task, with the goal of transferring knowledge from the former to the latter. Recent work has demonstrated the benefits of using a larger version of the same foundation model to assist the adaptation of a smaller one. Typically, the larger model (teacher) is first adapted via fine-tuning or linear probing before its knowledge is distilled into the smaller model (student). While fine-tuning the teacher often increases its performance, recent work showed that probing it leads to better knowledge distillation to the student. Our findings show that this is mainly due to a mis-alignment in feature representation between the teacher and the student which occurs during the fine-tuning of the teacher. Inspired from existing efforts to preserve previously learned knowledge, we first propose to leverage low-rank adaptation, resulting in better feature alignment and therefore better knowledge transfer. Drawing from this insight, we propose to further enhance the feature alignment, through a parameter-sharing strategy of the adapters between the two encoders during joint-training. Our proposed method SLAD shows a better feature alignment between the teacher and student which results in increased performance for not only the student but also for the teacher model while being $2\times$ faster to train than fine-tuning. Through extensive experiments on multiple datasets of classification and segmentation tasks, we demonstrate the improved accuracy and transfer efficiency of our method achieving state-of-the-art performance in the task-specific distillation framework.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 19494
Loading