Winning the Lottery Once and For All: Towards Pruning Neural Networks at Initialization

Winning the Lottery Once and For All: Towards Pruning Neural Networks at Initialization

TMLR Paper2354 Authors

08 Mar 2024 (modified: 17 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Lottery Ticket Hypothesis posits the existence of winning tickets, i.e., sparse subnetworks within randomly initialized dense neural networks that are capable of achieving test accuracy comparable to the original, unpruned counterpart when trained from scratch, with an optimal learning rate and in a similar training budget. Despite this promising conjecture, recent studies have cast doubt on the feasibility of identifying such winning tickets at initialization, particularly in large-scale settings. They suggest that in such expansive environments, winning tickets exclusively and only emerge during the early phase of training. This observation, contradicts the core tenet of LTH as these winning tickets do not truly win the initialization lottery. In light of recent findings, we address a critical question: If winning tickets can only be obtained during early iterations, does the initial training phase of a neural network encode vital knowledge, which we refer to as lottery-ticket information, that can be utilized to generate winning tickets at initialization, especially in large-scale scenarios? We affirmatively answer this question by introducing a novel premise, Knowledge Distillation-based Lottery Ticket Search. Our framework harnesses latent response, feature, and relation-based lottery-ticket information from an ensemble of teacher networks, employing a series of deterministic approximations to address an intractable Mixed Integer Optimization problem. This enables us to consistently win the initialization lottery in complex settings, identifying winning tickets right from the initialization point at sparsity levels - achieving as high as 95% for VGG-16 and 65\% for ResNet-20, and accomplishing this 19 times faster than Iterative Magnitude Pruning (IMP). Remarkably, without bells and whistles, even winning tickets identified early in the training process using our technique - consistently yield a performance gain of 2% for VGG-16 and 1.5% for ResNet-20 across various levels of sparsity, thereby surpassing existing methods.

Submission Length: Long submission (more than 12 pages of main content)

Assigned Action Editor: ~Yingbin_Liang1

Submission Number: 2354

Loading