- Abstract: This paper introduces a framework for mapping Recurrent Neural Network (RNN) architectures efficiently onto parallel processors such as GPUs. Key to our ap- proach is the use of persistent computational kernels that exploit the processor’s memory hierarchy to reuse network weights over multiple timesteps. Using our framework, we show how it is possible to achieve substantially higher computa- tional throughput at lower mini-batch sizes than direct implementations of RNNs based on matrix multiplications. Our initial implementation achieves 2.8 TFLOP/s at a mini-batch size of 4 on an NVIDIA TitanX GPU, which is about 45% of the- oretical peak throughput, and is 30X faster than a standard RNN implementation based on optimized GEMM kernels at this batch size. Reducing the batch size from 64 to 4 per processor provides a 16x reduction in activation memory foot- print, enables strong scaling to 16x more GPUs using data-parallelism, and allows us to efficiently explore end-to-end speech recognition models with up to 108 residual RNN layers.
- Conflicts: baidu.com