The Influence of SGD noise on Lottery Ticket Performance

The Influence of SGD noise on Lottery Ticket Performance

TMLR Paper4072 Authors

28 Jan 2025 (modified: 11 Apr 2025)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: The Lottery Ticket Hypothesis states that there exist sparse subnetworks (called 'winning' Lottery Tickets) within dense networks that, when trained under the same regime, achieve similar or better validation accuracy as the dense network. It has been shown that for larger networks and more complex datasets, an additional pretraining step is required for winning lottery tickets to be successfully found. Previous work linked the amount of pretraining required to a measurement of instability to SGD noise. In this paper, we take a closer look at the training hyperparameters that influence SGD instability during normal training and link this to the ability to find 'winning' tickets. We show that several techniques that have a positive influence on dense network generalization increase SGD instability, and as such hinder the extraction of 'winning' tickets. By dampening this instability via smart hyperparameter selection, we show that we can extract 'winning' tickets without pretraining and even outperform tickets found with pretraining at more extreme sparsities. We furthermore discover that tickets found with less instability to SGD noise have as unexpected side effect that they encode powerful features for classification in the untrained weights. We show that these features do not emerge when extracted under more unstable hyperparameter settings, and that they are transferable to different datasets, as well as enable faster training of the resulting tickets.

Submission Length: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Yani_Ioannou1

Submission Number: 4072

Loading