Asymmetrical Scaling Layers for Stable Network Pruning

Pierre Wolinski, Guillaume Charpiat, Yann Ollivier

31 Aug 2020OpenReview Archive Direct UploadReaders: Everyone

Abstract: We propose a new training setup, called ScaLa, and a new pruning algorithm based on it, called ScaLP. The new training setup ScaLa is designed to make standard Stochastic Gradient Descent resilient to layer width changes. It consists in adding a fixed well-chosen scaling layer before each linear or convolutional layer. This results in an overall learning behavior that is more independent of the layer widths, especially with respect to optimal learning rates, which stay close to 1. Beyond the usual choice of scaling each input by the factor 1/fan-in, we also propose a family of asymmetric scaling factors: this promotes learning some neurons faster than others. The pruning algorithm ScaLP is a combination of ScaLa with asymmetrical scaling, and weight penalties. With ScaLP, the final pruned architecture is roughly independent of the layer widths in the initial network.

0 Replies