Learning Sparse Neural Networks through L_0 Regularization


Nov 03, 2017 (modified: Nov 03, 2017) ICLR 2018 Conference Blind Submission readers: everyone Show Bibtex
  • Abstract: In order to learn the model structure, and for reasons of improved computational efficiency and generalization, we are often interested in learning the parameters of neural networks while strongly encouraging (blocks of) weights to take the value of exactly zero. The most general way to enforce this is by adopting an $L_0$ norm as a penalty on the weights. However, since the $L_0$ norm is non-differentiable, we cannot incorporate it directly as a regularization term in the objective function. We propose a solution through the inclusion of a collection of non-negative stochastic gates, which collectively determine which weights to set to zero. We show that, somewhat surprisingly, the expected $L_0$ norm of the resulting gated weights is differentiable with respect to the distribution parameters for certain well-chosen probability distributions over the gates. To this end, we employ a novel distribution over gates, which we name the \emph{hard concrete}; it is obtained by ``stretching'' a binary concrete distribution and then transforming its samples with a hard-sigmoid. The parameters of the distribution over the gates can then be jointly optimized with the original network parameters. As a result our method allows for straightforward and efficient learning of model structures with stochastic gradient descent and allows for conditional computation in a principled way. We perform various experiments to demonstrate the effectiveness of the resulting approach and regularizer.
  • TL;DR: We show how to optimize the expected L_0 norm of parametric models with gradient descent and introduce a new distribution that facilitates hard gating.
  • Keywords: Sparsity, compression, hard and soft attention.