Long Short-Term Memory as a Dynamically Computed Element-wise Weighted Sum

Omer Levy, Kenton Lee, Nicholas FitzGerald, Luke Zettlemoyer

Feb 15, 2018 (modified: Feb 15, 2018) ICLR 2018 Conference Blind Submission readers: everyone Show Bibtex
  • Abstract: Long short-term memory networks (LSTMs) were introduced to combat vanishing gradients in simple recurrent neural networks (S-RNNs) by augmenting them with additive recurrent connections controlled by gates. We present an alternate view to explain the success of LSTMs: the gates themselves are powerful recurrent models that provide more representational power than previously appreciated. We do this by showing that the LSTM's gates can be decoupled from the embedded S-RNN, producing a restricted class of RNNs where the main recurrence computes an element-wise weighted sum of context-independent functions of the inputs. Experiments on a range of challenging NLP problems demonstrate that the simplified gate-based models work substantially better than S-RNNs, and often just as well as the original LSTMs, strongly suggesting that the gates are doing much more in practice than just alleviating vanishing gradients.
  • TL;DR: Gates do all the heavy lifting in LSTMs by computing element-wise weighted sums, and removing the internal simple RNN does not degrade model performance.