- TL;DR: We analyze the gradient propagation in deep RNNs and from our analysis, we propose a new multi-layer deep RNN.
- Abstract: Recurrent Neural Networks (RNNs) are widely used models for sequence data. Just like for feedforward networks, it has become common to build "deep" RNNs, i.e., stack multiple recurrent layers to obtain higher-level abstractions of the data. However, this works only for a handful of layers. Unlike feedforward networks, stacking more than a few recurrent units (e.g., LSTM cells) usually hurts model performance, the reason being vanishing or exploding gradients during training. We investigate the training of multi-layer RNNs and examine the magnitude of the gradients as they propagate through the network. We show that, depending on the structure of the basic recurrent unit, the gradients are systematically attenuated or amplified, so that with an increasing depth they tend to vanish, respectively explode. Based on our analysis we design a new type of gated cell that better preserves gradient magnitude, and therefore makes it possible to train deeper RNNs. We experimentally validate our design with five different sequence modelling tasks on three different datasets. The proposed stackable recurrent (STAR) cell allows for substantially deeper recurrent architectures, with improved performance.
- Keywords: Deep RNN, Multi-layer RNN