DeFINE: Deep Factorized Input Token Embeddings for Neural Sequence Modeling

Sachin Mehta; Rik Koncel-Kedziorski; Mohammad Rastegari; Hannaneh Hajishirzi

DeFINE: Deep Factorized Input Token Embeddings for Neural Sequence Modeling

Sachin Mehta, Rik Koncel-Kedziorski, Mohammad Rastegari, Hannaneh Hajishirzi

Published: 20 Dec 2019, Last Modified: 05 May 2023ICLR 2020 Conference Blind SubmissionReaders: Everyone

Keywords: sequence modeling, input representations, language modeling, word embedding

TL;DR: DeFINE uses a deep, hierarchical, sparse network with new skip connections to learn better word embeddings efficiently.

Abstract: For sequence models with large vocabularies, a majority of network parameters lie in the input and output layers. In this work, we describe a new method, DeFINE, for learning deep token representations efficiently. Our architecture uses a hierarchical structure with novel skip-connections which allows for the use of low dimensional input and output layers, reducing total parameters and training time while delivering similar or better performance versus existing methods. DeFINE can be incorporated easily in new or existing sequence models. Compared to state-of-the-art methods including adaptive input representations, this technique results in a 6% to 20% drop in perplexity. On WikiText-103, DeFINE reduces the total parameters of Transformer-XL by half with minimal impact on performance. On the Penn Treebank, DeFINE improves AWD-LSTM by 4 points with a 17% reduction in parameters, achieving comparable performance to state-of-the-art methods with fewer parameters. For machine translation, DeFINE improves the efficiency of the Transformer model by about 1.4 times while delivering similar performance.

Code: [![github](/images/github_icon.svg) sacmehta/delight](https://github.com/sacmehta/delight)

Data: [WikiText-103](https://paperswithcode.com/dataset/wikitext-103), [WikiText-2](https://paperswithcode.com/dataset/wikitext-2)

Original Pdf: pdf

18 Replies

Loading