Frustratingly Short Attention Spans in Neural Language Modeling

Michał Daniluk, Tim Rocktäschel, Johannes Welbl, Sebastian Riedel

Nov 04, 2016 (modified: Feb 19, 2017) ICLR 2017 conference submission readers: everyone
  • Abstract: Current language modeling architectures often use recurrent neural networks. Recently, various methods for incorporating differentiable memory into these architectures have been proposed. When predicting the next token, these models query information from a memory of the recent history and thus can facilitate learning mid- and long-range dependencies. However, conventional attention models produce a single output vector per time step that is used for predicting the next token as well as the key and value of a differentiable memory of the history of tokens. In this paper, we propose a key-value attention mechanism that produces separate representations for the key and value of a memory, and for a representation that encodes the next-word distribution. This usage of past memories outperforms existing memory-augmented neural language models on two corpora. Yet, we found that it mainly utilizes past memory only of the previous five representations. This led to the unexpected main finding that a much simpler model which simply uses a concatenation of output representations from the previous three-time steps is on par with more sophisticated memory-augmented neural language models.
  • TL;DR: We investigate various memory-augmented neural language models and compare them against state-of-the-art architectures.
  • Keywords: Natural language processing, Deep learning
  • Conflicts:,,,,