Pruning Close to Home: Distance from Initialization impacts Lottery Tickets

TMLR Paper7003 Authors

13 Jan 2026 (modified: 23 Feb 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: The Lottery Ticket Hypothesis (LTH) states that there exist sparse subnetworks (called 'winning' Lottery Tickets) within dense randomly initialized networks that, when trained under the same regime, achieve similar or better validation accuracy as the dense network. It has been shown that for larger networks and more complex datasets, these Lottery Tickets cannot be found in randomly initializations, but that they require lightly pretrained weights. More specifically, the pretrained weights need to be stable to SGD noise, but calculating this metric involves an expensive procedure. In this paper, we take a closer look at certain training hyperparameters that influence SGD noise throughout optimization. We show that by smart hyperparameter selection we can forego the pretraining step and still find winning tickets in various settings. We term these hyperparameters early-stable, as networks trained with those become stable to SGD noise early during training, and discover that the tickets they produce, exhibit remarkable generalization properties. Finally, we hypothesize that a larger Learning Distance negatively impacts generalization of the resulting sparse network when iterative pruning, and devise an experiment to show this.
Submission Type: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=adgulKgRvC
Changes Since Last Submission: With respect to the previous submission, we have made the following significant changes: 1) Improved clarity and story of the paper. This includes rewriting large sections of the paper, adding additional structure to break up large portions of text, as well as removing several components which detracted from the main storyline of the paper. More specifically, we have dropped convergence speed and forgetting scores from the considered metrics. We also have included the terms of early-stable / late-stable hyperparameters for increased clarity in the text. 2) Additional exploration of the impact of Mask Search budget on various phenomena. Rather than briefly studying AIMP w.r.t. different hyperparameters, instead we follow up on this in the further sections related to few-shot generalizability, frozen features, and transferability. 3) Less focus on the pretrained networks. After demonstrating that winning tickets can be found without pretraining, we limit our focus in the rest of the paper on lottery tickets found at initialization. 4) Additional model-dataset combinations. We have included a larger model trained on the ImageNet-100 subset, as well as a Swin transformer trained on TinyImageNet to accomodate feedback from the reviewers that the original version was too focussed on ResNet models and smaller datasets. 5) Learning-distance hypothesis. Added an hypothesis based on learning distance as to why certain hyperparameters can find winning tickets at initialization, while others cannot. We start by showing that certain hyperparameters limit learning distance, leading to tickets found closer to the initialization. As these hyperparameters correspond to the hyperparameters that find winning tickets at initialization, we next employ a regularizer to artificially limit the learning distance, and show that this improves the generalization of the late-stable hyperparameters. To this end, we have also modified the title to better reflect upon this observation, rather than on the instability to SGD noise.
Assigned Action Editor: ~Yani_Ioannou1
Submission Number: 7003
Loading