Keywords: language model, LSTM, regularization, optimization, ASGD, dropconnect
TL;DR: Effective regularization and optimization strategies for LSTM-based language models achieves SOTA on PTB and WT2.
Abstract: In this paper, we consider the specific problem of word-level language modeling and investigate strategies for regularizing and optimizing LSTM-based models. We propose the weight-dropped LSTM, which uses DropConnect on hidden-to-hidden weights, as a form of recurrent regularization. Further, we introduce NT-ASGD, a non-monotonically triggered (NT) variant of the averaged stochastic gradient method (ASGD), wherein the averaging trigger is determined using a NT condition as opposed to being tuned by the user. Using these and other regularization strategies, our ASGD Weight-Dropped LSTM (AWD-LSTM) achieves state-of-the-art word level perplexities on two data sets: 57.3 on Penn Treebank and 65.8 on WikiText-2. In exploring the effectiveness of a neural cache in conjunction with our proposed model, we achieve an even lower state-of-the-art perplexity of 52.8 on Penn Treebank and 52.0 on WikiText-2. We also explore the viability of the proposed regularization and optimization strategies in the context of the quasi-recurrent neural network (QRNN) and demonstrate comparable performance to the AWD-LSTM counterpart. The code for reproducing the results is open sourced and is available at https://github.com/salesforce/awd-lstm-lm.
Code: [![github](/images/github_icon.svg) salesforce/awd-lstm-lm](https://github.com/salesforce/awd-lstm-lm) + [![Papers with Code](/images/pwc_icon.svg) 44 community implementations](https://paperswithcode.com/paper/?openreview=SyyGPP0TZ)
Data: [Penn Treebank](https://paperswithcode.com/dataset/penn-treebank), [WikiText-2](https://paperswithcode.com/dataset/wikitext-2)
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 41 code implementations](https://www.catalyzex.com/paper/arxiv:1708.02182/code)