AMD: Automatic Multi-step Distillation of Large-Scale Vision Models

Cheng Han, Qifan Wang, Sohail A. Dianat, Majid Rabbani, Raghuveer M. Rao, Yi Fang, Qiang Guan, Lifu Huang, Dongfang Liu

Published: 2024, Last Modified: 14 Jan 2026ECCV (65) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Transformer-based architectures have become the de-facto standard models for diverse vision tasks owing to their superior performance. As the size of these transformer-based models continues to scale up, model distillation becomes extremely important in real-world deployments, particularly on devices limited by computational resources. However, prevailing knowledge distillation methods exhibit diminished efficacy when confronted with a large capacity gap between the teacher and the student, e.g., 10\(\times \) compression rate. In this paper, we present a novel approach named Automatic Multi-step Distillation (AMD) for large-scale vision model compression. In particular, our distillation process unfolds across multiple steps. Initially, the teacher undergoes distillation to form an intermediate teacher-assistant model, which is subsequently distilled further to the student. An efficient and effective optimization framework is introduced to automatically identify the optimal teacher-assistant that leads to the maximal student performance. We conduct extensive experiments on multiple image classification datasets, including CIFAR-10, CIFAR-100, and ImageNet. The findings consistently reveal that AMD outperforms several established baselines, paving a path for future knowledge distillation methods on large-scale vision models.