Dynamic Pruning of a Neural Network via Gradient Signal-to-Noise Ratio

Julien Niklas Siems; Aaron Klein; Cedric Archambeau; Maren Mahsereci

Dynamic Pruning of a Neural Network via Gradient Signal-to-Noise Ratio

Julien Niklas Siems, Aaron Klein, Cedric Archambeau, Maren Mahsereci

Published: 14 Jul 2021, Last Modified: 05 May 2023AutoML@ICML2021 PosterReaders: Everyone

Keywords: Pruning, Neural Network, Gradient statistics

TL;DR: We propose a criterion for pruning weights in a neural network during training, which measures the noise in a mini-batch gradient.

Abstract: While training highly overparameterized neural networks is common practice in deep learning, research into post-hoc weight-pruning suggests that more than 90% of parameters can be removed without loss in predictive performance. To save resources, zero-shot and one-shot pruning attempt to find such a sparse representation at initialization or at an early stage of training. Though efficient, there is no justification, why the sparsity structure should not change during training. Dynamic sparsity pruning undoes this limitation and allows to adapt the structure of the sparse neural network during training. Recent approaches rely on weight magnitude pruning, which has been shown to be sub-optimal when applied at earlier training stages. In this work we propose to use the gradient noise to make pruning decisions. The procedure enables us to automatically adjust the sparsity during training without imposing a hand-designed sparsity schedule, while at the same time being able to recover from previous pruning decisions by unpruning connections as necessary. We evaluate our new method on image and tabular datasets and demonstrate that we reach similar performance as the dense model from which extract the sparse network, while exposing less hyperparameters than other dynamic sparsity methods.

Ethics Statement: We see opportunities of our work for positive impact, by contributing to potential energy savings in training neural networks and making them more memory efficient, hence reducing their carbon footprint and financial cost. However, our work may also lead to negative side effects. For example, while our method produces sparse networks, the resulting classifiers may not necessarily be fair and could for example give dominating groups in a population even greater weight, than non-sparse networks.

Crc Pdf: pdf

Poster Pdf: pdf

Original Version: pdf

4 Replies

Loading