Revisiting the Lottery Ticket Hypothesis for Pre-trained Networks

16 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: lottery ticket hypothesis, transfer learning
TL;DR: Based on the fact that pre-trained networks is significantly more stable than randomly initialized models, we empirically demonstrate that we can find winning tickets efficiently in the context of transfer learning.
Abstract: The lottery ticket hypothesis (LTH) suggests the possibility of pruning neural networks at initialization. Our study revisits LTH in the context of transfer learning, unveiling novel insights surpassing prior studies limited to LTH's application in pre-trained networks. To begin, our study shows that multiple pruning-at-initialization methods are likely to find worse pruning masks than a simple magnitude-based pruning method for pre-trained networks, owing to an inaccurate approximation of the influence of each weight. Iterative magnitude pruning (IMP) can find trainable subnetworks (winning tickets) even for pre-trained networks, however, IMP is a costly algorithm that requires multiple training cycles. Given that trainable subnetworks can be identified only when the initial network withstands the training's inherent randomness, and considering the superior resilience of pre-trained networks to this randomness compared to randomly initialized networks, we empirically demonstrate the enhanced efficiency of identifying trainable subnetworks within the framework of transfer learning. By challenging conventional wisdom surrounding gradual magnitude pruning (GMP), we reveal its capability to significantly enhance the trade-off between transfer learning performance and sparsity in terms of pruning-at-initialization. Our experiments, which involve various models such as convolutional neural networks and transformers, across both vision and language domains, demonstrate that GMP can identify trainable subnetworks for pre-trained networks at a significantly lower cost than IMP. For example, for ImageNet pre-trained ResNet-50, at a pruning ratio of 99%, GMP achieves comparable or superior results to IMP on the CIFAR, Caltech-101, Oxford-IIIT Pets, and Stanford Cars datasets, with 42 times less computation than IMP. Ultimately, we provide empirical evidence that the methodological distinction between the LTH-based and conventional pruning methods can be blurred for pre-trained networks.
Supplementary Material: zip
Primary Area: general machine learning (i.e., none of the above)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 507
Loading