Causal Distillation for Language ModelsDownload PDF

Anonymous

08 Mar 2022 (modified: 05 May 2023)NAACL 2022 Conference Blind SubmissionReaders: Everyone
Paper Link: https://openreview.net/forum?id=LdHAEuAKQQ
Paper Type: Short paper (up to four pages of content + unlimited references and appendices)
Abstract: Distillation efforts have led to language models that are more compact and efficient without serious drops in performance. The standard approach to distillation trains a student model against two objectives: a task-specific objective (e.g., language modeling) and an imitation objective that encourages the hidden states of the student model to be similar to those of the larger teacher model. In this paper, we show that it is beneficial to augment distillation with a third objective that encourages the student to imitate the \emph{causal} dynamics of the teacher through a distillation interchange intervention training objective (DIITO). DIITO pushes the student model to become a \emph{causal abstraction} of the teacher model -- a faithful model with simpler causal structure. DIITO is fully differentiable, easily implemented, and combines flexibly with other objectives. Compared against standard distillation with the same setting, DIITO results in lower perplexity on the WikiText-103M corpus (masked language modeling) and marked improvements on the GLUE benchmark (natural language understanding), SQuAD (question answering), and CoNLL-2003 (named entity recognition).
Presentation Mode: This paper will be presented virtually
Virtual Presentation Timezone: UTC-8
Copyright Consent Signature (type Name Or NA If Not Transferrable): Zhengxuan Wu
Copyright Consent Name And Address: Stanford University, 450 Serra Mall, Stanford, CA 94305
0 Replies

Loading