How much pre-training is enough to discover a good subnetwork?

Cameron R. Wolfe; Qihan Wang; Junhyung Lyle Kim; Anastasios Kyrillidis

How much pre-training is enough to discover a good subnetwork?

Cameron R. Wolfe, Qihan Wang, Junhyung Lyle Kim, Anastasios Kyrillidis

29 Sept 2021 (modified: 13 Feb 2023)ICLR 2022 Conference Withdrawn SubmissionReaders: Everyone

Keywords: lottery ticket hypothesis, pruning, greedy selection

Abstract: Neural network pruning is useful for discovering efficient, high-performing subnetworks within pre-trained, dense network architectures. However, more often than not, it involves a three-step process—pre-training, pruning, and re-training—that is computationally expensive, as the dense model must be fully pre-trained. Luckily, several works have empirically shown that high-performing subnetworks can be discovered via pruning without fully pre-training the dense network. Aiming to theoretically analyze the amount of dense network pre-training needed for a pruned network to perform well, we discover a theoretical bound in the number of SGD pre-training iterations on a two-layer, fully-connected network, beyond which pruning via greedy forward selection (Ye et al., 2020) yields a subnetwork that achieves good training error. This threshold is shown to be logarithmically dependent upon the size of the dataset, meaning that experiments with larger datasets require more pre-training for subnetworks obtained via pruning to perform well. We empirically demonstrate the validity of our theoretical results across a variety of architectures and datasets, including fully-connected networks trained on MNIST and several deep convolutional neural network (CNN) architectures trained on CIFAR10 and ImageNet.

One-sentence Summary: We provide a theoretical bound on the number of SGD pre-training iterations for a two-layer network, beyond which subnetworks pruned from this dense model perform well.

Supplementary Material: zip

9 Replies

Loading