Keywords: pruning, lottery ticket hypothesis, finetuning
Abstract: Scope of Reproducibility We are reproducing Comparing Rewinding and Fine-tuning in Neural Networks, by Renda et al. In this work the authors compare three different approaches to retraining neural networks after pruning: 1) fine-tuning, 2) rewinding weights as in Frankle et al. and 3) a new, original method involving learning rate rewinding, building upon Frankle et al. We reproduce the results of all three approaches, but we focus on verifying their approach, learning rate rewinding, since it is newly proposed and is described as a universal alternative to other methods. We used CIFAR10 for most reproductions along with additional experiments on the larger CIFAR100 which extends the result originally provided by the authors. We've also extended the list of tested network architectures to include Wide ResNets. The new experiments led us to discover the limitations of learning rate rewinding which can worsen pruning results on large architectures. Methodology We implemented the code ourselves in Python with TensorFlow 2, basing our implementation of the paper alone and without consulting the source code provided by the authors. We ran two sets of experiments. In the reproduction set, we have striven to exactly reproduce the experimental conditions of Renda et al. We have also conducted additional experiments, which use other network architectures, effectively showing results previously unreported by the authors. We did not cover all originally reported experiments -- we covered as many as needed to state the validity of claims. We used Google Cloud resources and a local machine with 2x RTX 3080 GPUs. Results We were able to reproduce the exact results reported by the authors in all originally reported scenarios. However, extended results on larger Wide Residual Networks have demonstrated the limitations of the newly proposed learning rate rewinding -- we observed a previously unreported accuracy degradation for low sparsity ranges. Nevertheless, the general conclusion of the paper still holds and was indeed reproduced. What was easy Re-implementation of the pruning and retraining methods was technically easy, as it is based on a popular and simple pruning criterion -- magnitude pruning. Original work was descriptive enough to reproduce the results with satisfying results without consulting the code. What was difficult Not every design choice was mentioned in the paper, thus reproducing the exact results was rather difficult and required a meticulous choice of hyper-parameters. Experiments on ImageNet and WMT16 datasets were time consuming and required extensive resources, thus we did not verify them. Communication with original authors We did not consult the original authors, as there was no need to.
Paper Url: https://openreview.net/forum?id=YrKu2s0HaG6Q¬eId=SxQ0jqm5O0w
Supplementary Material: zip
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 4 code implementations](https://www.catalyzex.com/paper/arxiv:2003.02389/code)