- Abstract: Long Short-Term Memory (LSTM) is widely used to solve sequence modeling problems, for example, image captioning. We found the LSTM cells are heavily redundant. We adopt network pruning to reduce the redundancy of LSTM and introduce sparsity as new regularization to reduce overfitting. We can achieve better performance than the dense baseline while reducing the total number of parameters in LSTM by more than 80%, from 2.1 million to only 0.4 million. Sparse LSTM can improve the BLUE-4 score by 1.3 points on Flickr8k dataset and CIDER score by 1.7 points on MSCOCO dataset. We explore four types of pruning policies on LSTM, visualize the sparsity pattern, weight distribution of sparse LSTM and analyze the pros and cons of each policy.
- TL;DR: We achieve better performance with 80% less parameters by introducing sparsity to LSTM
- Keywords: Deep learning
- Conflicts: nvidia.com, stanford.edu, tsinghua.edu.cn