Abstract: Knowledge Distillation (KD) [6], as an effective technique for model compression and improving a model’s performance, has been widely studied and adopted. However, most previous researches focus on image classification and few on sequence generation (such as Neural Machine Translation). We also note that few works for image captioning have incorporated KD, but they mainly treat it as a training trick. In contrast, we thoroughly investigate KD in the context of the image captioning task by conducting a series of experiments in this work. Specifically, we first apply the standard word-level KD to the image captioning model and explore cross-model distillation and self-distillation. We find that self-distillation is a practical choice that can achieve competitive performance while without spending time on choosing teacher’s architecture. Inspired by the sequence-level distillation for Neural Machine Translation (NMT) [11], we secondly adopt and modify it for image captioning and observe that competitive performance can be obtained using only one-fifth of resources and the speed of inference can be significantly improved by eliminating the need for beam search at the cost of slight performance degradation. Inspired by distilling BERT [19] for NMT, we finally try to distill VL-BERT [12] to make the captioning model look ahead by leveraging its bidirectional nature.
External IDs:dblp:conf/cicai/DongHZ21
Loading