Abstract: Knowledge distillation has emerged as a promising technique for compressing neural language models. However, most knowledge distillation methods focus on extracting the ``knowledge'' from a teacher network to guide the training of a student network, ignoring the ``requirements'' of the student. In this paper, we introduce Tree Knowledge Distillation for Transformer-based teacher and student models, which allows student to actively extract its ``requirements'' via a tree of tokens. In specific, we first choose the |[CLS]| token at the output layer of Transformer in student as the root of the tree. We choose tokens with the highest values in the row for |[CLS]| of the attention feature map at the second last layer as the children of the root. Then we choose children of these nodes in their corresponding rows of the attention feature map at the next layer, respectively. Later, we connect layers of Transformer in student to corresponding layers in teacher by skipping every $t$ layers. At last, we improve the loss function by adding the summed mean squared errors between the embeddings of the tokens in the tree. The experiments show that tree knowledge distillation achieves competitive performance for compressing BERT among other knowledge distillation methods in GLUE benchmark.
Paper Type: long
0 Replies
Loading