Finding Deep Local Optima Using Network PruningDownload PDF

25 Sep 2019 (modified: 24 Dec 2019)ICLR 2020 Conference Blind SubmissionReaders: Everyone
  • Original Pdf: pdf
  • Keywords: network pruning, non-convex optimization
  • Abstract: Artificial neural networks (ANNs) are very popular nowadays and offer reliable solutions to many classification problems. However, training deep neural networks (DNN) is time-consuming due to the large number of parameters. Recent research indicates that these DNNs might be over-parameterized and different solutions have been proposed to reduce the complexity both in the number of parameters and in the training time of the neural networks. Furthermore, some researchers argue that after reducing the neural network complexity via connection pruning, the remaining weights are irrelevant and retraining the sub-network would obtain a comparable accuracy with the original one. This may hold true in most vision problems where we always enjoy a large number of training samples and research indicates that most local optima of the convolutional neural networks may be equivalent. However, in non-vision sparse datasets, especially with many irrelevant features where a standard neural network would overfit, this might not be the case and there might be many non-equivalent local optima. This paper presents empirical evidence for these statements and an empirical study of the learnability of neural networks (NNs) on some challenging non-linear real and simulated data with irrelevant variables. Our simulation experiments indicate that the cross-entropy loss function on XOR-like data has many local optima, and the number of local optima grows exponentially with the number of irrelevant variables. We also introduce a connection pruning method to improve the capability of NNs to find a deep local minimum even when there are irrelevant variables. Furthermore, the performance of the discovered sparse sub-network degrades considerably either by retraining from scratch or the corresponding original initialization, due to the existence of many bad optima around. Finally, we will show that the performance of neural networks for real-world experiments on sparse datasets can be recovered or even improved by discovering a good sub-network architecture via connection pruning.
4 Replies