Not All Lotteries Are Made Equal

Not All Lotteries Are Made Equal

Introduction

The Lottery Ticket Hypothesis (LTH) [1] states that for a reasonably sized neural network, there exists a subnetwork within the same network that, when trained from the same initialization, yields no less performance than the dense counterpart.

In simpler words, there exists at least one sparse model inside the dense model when trained with the same initial parameters, that is capable of attaining the same (or even better) performance as the dense model. These well-performing subnetworks are called winning tickets. Winning tickets can have up to 90% of the weights removed with no loss in performance [1], hence providing computational efficiency while maintaining its quality.

Finding these subnetworks is traditionally done by using off-the-shelf pruning methods to remove unimportant weights iteratively. However, the importance of weights is a loosely defined term, and each pruning method has its own measure of determining how important the weights are.

To the best of our knowledge, prior work regarding the LTH has only investigated overparameterized models, and the emergence of the LTH is often attributed to the initial model being large, i.e., a dense sampling of tickets.

In this blog post, we present evidence that challenges this notion of the LTH. We investigate the effect of model size and the ease of finding winning tickets. Through this work, we show that winning tickets is in-fact, easier to find for smaller models.

Background

Trajectory Length

Raghu et al. [2] propose a measure for the expressive power of a neural network. A circular trajectory/arc (in 2D) is fed to a model, and the model’s output is projected to two dimensions. The trajectory length is simply the length of the arc after transformation by the model, as observed at the output.

The authors note that as the model is trained, the trajectory length increases and grows exponentially with depth.

Trajectory Length Image taken from Raghu et al. [2].

Winning Tickets

Following Savarese et al. [3], we define two types of winning tickets:

  1. Best Sparse Model: The best performing sparse model regardless of its extent of sparsity.
  2. Sparsest Matching Model: The sparsest model that is at least as performant as the dense model.

Experiments

Setup

Metrics

Firstly, we define the metrics used in further sections:

  • Success Rate $ SR = \frac{N_{success}}{N_{total}} \cdot 100 $

Here, $N_{success}$ is the number of times the ticket search yields at least one winning ticket; and $N_{total}$ is the total number of times the ticket search was run.

  • Accuracy Gain $A_{gain} = \frac{(A_{sparse} - A_{dense})}{A_{dense}} \cdot 100$

Here, $A_{sparse}$ and $A_{dense}$ are the test accuracy attained by the sparse and dense models respectively.

  • Trajectory Length Gain $TL_{gain} = \frac{(TL_{sparse} - TL_{dense})}{TL_{dense}} \cdot 100 $

Here, $TL_{sparse}$ and $TL_{dense}$ are the trajectory lengths of the sparse and dense models respectively, after training.

Model Architectures

We investigate six different model architectures. In the table below, we list the chosen networks and their respective number of parameters to represent the size of the model.

Name # Params
LeNet-5 61K
PNASNet-A 0.13M
PNASNet-B 0.45M
ResNet32 0.46M
MobileNetV1 3.22M
EfficientNet 3.59M
ResNet18 11.17M

Prior work [3] has shown that Ticket Search with some pruning methods fail completely on ResNets, but work well for Linear-mode connected models such as VGG. Lee et al. [4] claim that LAMP works well for VGG and ResNets.

In this work, we use Layer-Adaptive Magnitude Pruning (LAMP) [4] for finding winning tickets. The ticket search algorithm we use is as follows:

  1. Initialize a model M.
  2. Store initial weights.
  3. Train M on the chosen dataset for a number of iterations.
  4. Perform pruning using LAMP.
  5. Rewind values of surviving weights.
  6. Repeat Steps 3-5, k times.

For training a large number of models, we set k=5, which is a very conservative value, considering that the authors of LAMP used k=20. This also means that the ticket search process is not nearly as effective, since a large number of weights are pruned at each stage. However, we use the same setup for each architecture.

We perform ticket search 50 times for each architecture on the CIFAR-10 dataset. We compute trajectory lengths, percentage of surviving weights, test accuracy etc. after each stage of pruning.

Other experimental details can be found in the code, which will be released after the review period. We extensively build on the code-base of LAMP [4]. This work wouldn’t have been possible without their open-sourced code.

We now present evidence for the claims made in Section 1.

Results

Ticket Search Difficulty

For the architectures we tested, we observe that it is easier to find winning tickets for smaller architectures, with LeNet-5 being the exception.

Quality of Winning Tickets

We make the following observations:

  • ResNet18 is the largest, both, in terms of depth, and number of parameters, yet it has the lowest Success Rate, i.e., it is hard to find winning tickets. This is in contradiction to the notion of “Lotteries” in the LTH. A larger model doesn’t necessarily imply easier ticket search.
  • LeNet-5 has a very low success rate, however, when it does find winning tickets, the gain in accuracy is the highest.
  • EfficientNetV1 has one of the highest success rates, yet the lowest Accuracy Gain.

This suggests that it is unlikely that model size is the only reason for the emergence of the LTH.

We also note that, usually, small architectures benefit from ticket search just as much or more than the larger ResNet18 architecture.

The trajectory length of the Best Sparse model increases or stays the same as dense model. Interestingly, however, for the Sparsest Matching model, the trajectory length decreases significantly. PNASNet-A being the exception to this trend.

Summarized Statistics

We provide summarized statistics for the readers to make inferences.

Summarized Statistics

References

[1] Frankle, J. & Carbin, M. (2019). The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. ICLR. 2019

[2] Raghu, M., Poole, B., Kleinberg, J., Ganguli, S. & Sohl-Dickstein, J. (2017). On the Expressive Power of Deep Neural Networks. ICML. 2016

[3] Savarese, P.H., Silva, H., & Maire, M. (2020). Winning the Lottery with Continuous Sparsification. NeurIPS. 2021

[4] Lee, J., Park, S., Mo, S., Ahn, S., & Shin, J. (2021). Layer-adaptive Sparsity for the Magnitude-based Pruning. ICLR. 2021