Knowledge Distillation from Few SamplesDownload PDF

27 Sept 2018 (modified: 05 May 2023)ICLR 2019 Conference Blind SubmissionReaders: Everyone
Abstract: Current knowledge distillation methods require full training data to distill knowledge from a large "teacher" network to a compact "student" network by matching certain statistics between "teacher" and "student" such as softmax outputs and feature responses. This is not only time-consuming but also inconsistent with human cognition in which children can learn knowledge from adults with few examples. This paper proposes a novel and simple method for knowledge distillation from few samples. Taking the assumption that both "teacher" and "student" have the same feature map sizes at each corresponding block, we add a $1\times 1$ conv-layer at the end of each block in the student-net, and align the block-level outputs between "teacher" and "student" by estimating the parameters of the added layer with limited samples. We prove that the added layer can be absorbed/merged into the previous conv-layer \hl{to formulate a new conv-layer with the same size of parameters and computation cost as previous one. Experiments verifies that the proposed method is very efficient and effective to distill knowledge from teacher-net to student-net constructing in different ways on various datasets.
Keywords: knowledge distillation, few-sample learning, network compression
TL;DR: This paper proposes a novel and simple method for knowledge distillation from few samples.
18 Replies

Loading