Not All Lotteries Are Made Equal
01 Dec 2021 | lottery ticket hypothesis pruning sparsityNot All Lotteries Are Made Equal
Introduction
The Lottery Ticket Hypothesis (LTH) [1] states that for a reasonably sized
neural network, there exists a subnetwork within the same network that, when trained from the same initialization, yields no less performance than the dense counterpart.
In simpler words, there exists at least one sparse model inside the dense model when trained with the same initial parameters, that is capable of attaining the same (or even better) performance as the dense model. These well-performing subnetworks are called winning tickets
.
Winning tickets can have up to 90% of the weights removed with no loss in performance [1], hence providing computational efficiency while maintaining its quality.
Finding these subnetworks is traditionally done by using off-the-shelf pruning methods to remove unimportant
weights iteratively. However, the importance
of weights is a loosely defined term, and each pruning method has its own measure of determining how important the weights are.
To the best of our knowledge, prior work regarding the LTH has only investigated overparameterized models, and the emergence of the LTH is often attributed to the initial model being large, i.e., a dense sampling of tickets
.
In this blog post, we present evidence that challenges this notion of the LTH. We investigate the effect of model size and the ease
of finding winning tickets. Through this work, we show that winning tickets is in-fact, easier to find for smaller models.
Background
Trajectory Length
Raghu et al. [2] propose a measure for the expressive power of a neural network. A circular trajectory/arc (in 2D) is fed to a model, and the model’s output is projected to two dimensions. The trajectory length
is simply the length of the arc after transformation by the model, as observed at the output.
The authors note that as the model is trained, the trajectory length increases and grows exponentially with depth.

Winning Tickets
Following Savarese et al. [3], we define two types of winning tickets:
Best Sparse Model
: The best performing sparse model regardless of its extent of sparsity.Sparsest Matching Model
: The sparsest model that is at least as performant as the dense model.
Experiments
Setup
Metrics
Firstly, we define the metrics used in further sections:
Success Rate
$ SR = \frac{N_{success}}{N_{total}} \cdot 100 $
Here, $N_{success}$ is the number of times the ticket search yields at least one winning ticket; and $N_{total}$ is the total number of times the ticket search was run.
Accuracy Gain
$A_{gain} = \frac{(A_{sparse} - A_{dense})}{A_{dense}} \cdot 100$
Here, $A_{sparse}$ and $A_{dense}$ are the test accuracy attained by the sparse and dense models respectively.
Trajectory Length Gain
$TL_{gain} = \frac{(TL_{sparse} - TL_{dense})}{TL_{dense}} \cdot 100 $
Here, $TL_{sparse}$ and $TL_{dense}$ are the trajectory lengths of the sparse and dense models respectively, after training.
Model Architectures
We investigate six different model architectures. In the table below, we list the chosen networks and their respective number of parameters to represent the size of the model.
Name | # Params |
---|---|
LeNet-5 | 61K |
PNASNet-A | 0.13M |
PNASNet-B | 0.45M |
ResNet32 | 0.46M |
MobileNetV1 | 3.22M |
EfficientNet | 3.59M |
ResNet18 | 11.17M |
Ticket Search
Prior work [3] has shown that Ticket Search with some pruning methods fail completely on ResNets, but work well for Linear-mode connected models such as VGG. Lee et al. [4] claim that LAMP works well for VGG and ResNets.
In this work, we use Layer-Adaptive Magnitude Pruning (LAMP) [4] for finding winning tickets. The ticket search algorithm we use is as follows:
- Initialize a model
M
. - Store initial weights.
- Train
M
on the chosen dataset for a number of iterations. - Perform pruning using LAMP.
- Rewind values of surviving weights.
- Repeat Steps 3-5,
k
times.
For training a large number of models, we set k=5
, which is a very conservative value, considering that the authors of LAMP used k=20
. This also means that the ticket search process is not nearly as effective, since a large number of weights are pruned at each stage. However, we use the same setup for each architecture.
We perform ticket search 50 times for each architecture on the CIFAR-10 dataset. We compute trajectory lengths, percentage of surviving weights, test accuracy etc. after each stage of pruning.
Other experimental details can be found in the code, which will be released after the review period. We extensively build on the code-base of LAMP [4]. This work wouldn’t have been possible without their open-sourced code.
We now present evidence for the claims made in Section 1.
Results
Ticket Search Difficulty
For the architectures we tested, we observe that it is easier
to find winning tickets for smaller architectures, with LeNet-5 being the exception.
Quality of Winning Tickets
We make the following observations:
- ResNet18 is the largest, both, in terms of depth, and number of parameters, yet it has the lowest
Success Rate
, i.e., it is hard to find winning tickets. This is in contradiction to the notion of “Lotteries” in the LTH. A larger model doesn’t necessarily imply easier ticket search. - LeNet-5 has a very low success rate, however, when it does find winning tickets, the gain in accuracy is the highest.
- EfficientNetV1 has one of the highest success rates, yet the lowest Accuracy Gain.
This suggests that it is unlikely that model size is the only reason for the emergence of the LTH
.
We also note that, usually, small architectures benefit from ticket search just as much or more than the larger ResNet18 architecture.