R2D2: Reuse & Reduce via Dynamic Weight Diffusion for Training Efficient NLP ModelsDownload PDF

25 Sep 2019 (modified: 24 Dec 2019)ICLR 2020 Conference Blind SubmissionReaders: Everyone
  • Original Pdf: pdf
  • Keywords: Deep Learning, Natural Language Processing
  • TL;DR: Efficient transform layers inspired by Hamilton Products save parameters
  • Abstract: We propose R2D2 layers, a new neural block for training efficient NLP models. Our proposed method is characterized by a dynamic weight diffusion mechanism which learns to reuse and reduce parameters in the conventional transformation layer, commonly found in popular Transformer/LSTMs models. Our method is inspired by recent Quaternion methods which share parameters via the Hamilton product. This can be interpreted as a neural and learned approximation of the Hamilton product which imbues our method with increased flexibility and expressiveness, i.e., we are no longer restricted by the 4D nature of Quaternion weight sharing. We conduct extensive experiments in the NLP domain, showing that R2D2 (i) enables a parameter savings of up to 2 times to 16 times with minimal degradation of performance and (ii) outperforms other parameter savings alternative such as low-rank factorization and Quaternion methods.
7 Replies