Abstract: Knowledge distillation (KD) is a popular method for reducing the computational
overhead of deep network inference, in which the output of a teacher model is used
to train a smaller, faster student model. Hint training (i.e., FitNets) extends KD by
regressing a student model’s intermediate representation (IR) to a teacher model’s IR.
In this work, we introduce bLock-wise Intermediate representation Training (LIT), a
novel model compression technique that extends the use of IRs in deep network com-
pression, outperforming KD and hint training. LIT has two key ideas: 1) LIT trains a
student of the same width (but shallower depth) as the teacher by directly comparing
the IRs, and 2) LIT uses the IR from the previous block in the teacher model as an
input to the current student block during training, avoiding unstable IRs in the student
network. We show that LIT provides substantial reductions in network depth without
loss in accuracy — for example, LIT can compress a ResNeXt-110 to a ResNeXt-20
(5.5×) on CIFAR10 and a VDCNN-29 to a VDCNN-9 (3.2×) on Amazon Reviews,
outperforming KD and hint training in network size for a given accuracy. Finally, we
show that LIT can effectively compress GAN generators, which are not supported
in the KD framework because GANs output pixels as opposed to probabilities.
7 Replies
Loading