Abstract: The Lottery Ticket Hypothesis states that there exist sparse subnetworks (called 'winning' Lottery Tickets) within dense networks that, when trained under the same regime, achieve similar or better validation accuracy as the dense network. It has been shown that for larger networks and more complex datasets, an additional pretraining step is required for winning lottery tickets to be successfully found. Previous work linked the amount of pretraining required to a measurement of instability to SGD noise. In this paper, we take a closer look at the training hyperparameters that influence SGD instability during normal training and link this to the ability to find 'winning' tickets. We show that several techniques that have a positive influence on dense network generalization increase SGD instability, and as such hinder the extraction of 'winning' tickets. By dampening this instability via smart hyperparameter selection, we show that we can extract 'winning' tickets without pretraining and even outperform tickets found with pretraining at more extreme sparsities. We furthermore discover that tickets found with less instability to SGD noise have as unexpected side effect that they encode powerful features for classification in the untrained weights. We show that these features do not emerge when extracted under more unstable hyperparameter settings, and that they are transferable to different datasets, as well as enable faster training of the resulting tickets.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Yani_Ioannou1
Submission Number: 4072
Loading