Cross-Architecture Distillation Using Bidirectional CMOW Embeddings

Lukas Paul Achatius Galke; Isabelle Cuber; Christoph Meyer; Henrik Ferdinand Nölscher; Angelina Sonderecker; Ansgar Scherp

Cross-Architecture Distillation Using Bidirectional CMOW Embeddings

Lukas Paul Achatius Galke, Isabelle Cuber, Christoph Meyer, Henrik Ferdinand Nölscher, Angelina Sonderecker, Ansgar Scherp

Published: 28 Jan 2022, Last Modified: 13 Feb 2023ICLR 2022 SubmittedReaders: Everyone

Keywords: natural language processing, word embedding, knowledge distillation, model compression, efficient methods, transfer learning

Abstract: Large pretrained language models (PreLMs) are revolutionizing natural language processing across all benchmarks. However, their sheer size is prohibitive for small laboratories or deployment on mobile devices. Approaches like pruning and distillation reduce the model size but typically retain the same model architecture. In contrast, we explore distilling PreLMs into a different, more efficient architecture CMOW, which embeds each word as a matrix and uses matrix multiplication to encode sequences. We extend the CMOW architecture and its CMOW/CBOW-Hybrid variant with a bidirectional component, per-token representations for distillation during pretraining, and a two-sequence encoding scheme that facilitates downstream tasks on sentence pairs such as natural language inferencing. Our results show that the embedding-based models yield scores comparable to DistilBERT on QQP and RTE, while using only half of its parameters and providing three times faster inference speed. We match or exceed the scores of ELMo, and only fall behind more expensive models on linguistic acceptability. Still, our distilled bidirectional CMOW/CBOW-Hybrid model more than doubles the scores on linguistic acceptability compared to previous cross-architecture distillation approaches. Furthermore, our experiments confirm the positive effects of bidirection and the two-sequence encoding scheme.

One-sentence Summary: Large pretrained language models can be effectively distilled into more efficient matrix embedding models.

Supplementary Material: zip

15 Replies

Loading