Training Long Short-Term Memory With Sparsified Stochastic Gradient Descent

Maohua Zhu; Minsoo Rhu; Jason Clemons; Stephen W. Keckler; Yuan Xie

Training Long Short-Term Memory With Sparsified Stochastic Gradient Descent

Maohua Zhu, Minsoo Rhu, Jason Clemons, Stephen W. Keckler, Yuan Xie

12 Jul 2025 (modified: 21 Jul 2022)Submitted to ICLR 2017Readers: Everyone

Abstract: Prior work has demonstrated that exploiting the sparsity can dramatically improve the energy efficiency and shrink the memory footprint of Convolutional Neural Networks (CNNs). However, these sparsity-centric optimization techniques might be less effective for Long Short-Term Memory (LSTM) based Recurrent Neural Networks (RNNs), especially for the training phase, because of the significant structural difference between the neurons. To investigate if there is possible sparsity-centric optimization for training LSTM-based RNNs, we studied several applications and observed that there is potential sparsity in the gradients generated in the backward propagation. In this paper, we illustrate why the sparsity exists and propose a simple yet effective thresholding technique to induce further more sparsity during training an LSTM-based RNN training. Experiment results show that the proposed technique can increase the sparsity of linear gate gradients to higher than 80\% without loss of performance, which makes more than 50\% multiply-accumulate (MAC) operations redundant in an entire LSTM training process. These redudant MAC operations can be eliminated by hardware techniques to improve the energy efficiency and training speed of LSTM-based RNNs.

TL;DR: A simple yet effective technique to induce considerable amount of sparsity in LSTM training

Conflicts: nvidia.com, ucsb.edu

Keywords: Optimization, Deep learning

8 Replies

Loading