% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language Wfor creating drawings and diagrams

% For algorithm
% \usepackage{algorithmic}
\usepackage{algorithm}
\usepackage[compatible]{algpseudocode} % or \usepackage{algcompatible}
\renewcommand{\algorithmiccomment}[1]{\hfill$\triangleright$\textit{\mdseries{#1}}}

% For theorems and such
\usepackage{amsmath}
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\usepackage{bm}
\usepackage{amssymb}

\usepackage{mathtools}
\usepackage{amsthm}
\usepackage{color}
% For fig
\usepackage{microtype}
\usepackage{graphicx}
\usepackage{subfigure}
\usepackage{arydshln} 
\usepackage{xr}
\externaldocument{yin_307-supp}

\usepackage{multirow}


\makeatletter
\def\adl@drawiv#1#2#3{%
        \hskip.5\tabcolsep
        \xleaders#3{#2.5\@tempdimb #1{1}#2.5\@tempdimb}%
                #2\z@ plus1fil minus1fil\relax
        \hskip.5\tabcolsep}
\newcommand{\cdashlinelr}[1]{%
  \noalign{\vskip\aboverulesep
           \global\let\@dashdrawstore\adl@draw
           \global\let\adl@draw\adl@drawiv}
  \cdashline{#1}
  \noalign{\global\let\adl@draw\@dashdrawstore
           \vskip\belowrulesep}}
\makeatother

% \usepackage{multirow}
%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Superposing Many Tickets into One: \\ A Performance Booster for Sparse Neural Network Training}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
% Add authors
\author[1]{Lu Yin}
\author[1]{Vlado Menkovski}
\author[1]{Meng Fang}
\author[1]{Tianjin Huang}
\author[1]{Yulong Pei}
\author[1]{Mykola Pechenizkiy} 
\author[2,1]{\\Decebal Constantin Mocanu}
\author[1]{Shiwei Liu}
% Add affiliations after the authors
\affil[1]{%
    Eindhoven University of Technology \\
    Eindhoven, the Netherlands
}
\affil[2]{%
    University of Twente\\
    Enschede, the Netherlands
}
% \author[1]{\href{mailto:<l.yin@tue.nl>?Subject=Your UAI 2022 paper}{Lu Yin}{}}
% \author[1]{\href{mailto:<l.yin@tue.nl>?Subject=Your UAI 2022 paper}{Vlado Menkovski}{}}
% \author[1]{\href{mailto:<l.yin@tue.nl>?Subject=Your UAI 2022 paper}{Meng Fang}{}}
% \author[1]{\href{mailto:<l.yin@tue.nl>?Subject=Your UAI 2022 paper}{Tianjin Huang}{}}
% \author[1]{\href{mailto:<l.yin@tue.nl>?Subject=Your UAI 2022 paper}{Yulong Pei}{}}
% \author[1]{\href{mailto:<l.yin@tue.nl>?Subject=Your UAI 2022 paper}{Mykola Pechenizkiy}{}}\\
% \author[1,2]{\href{mailto:<l.yin@tue.nl>?Subject=Your UAI 2022 paper}{Decebal Constantin Mocanu}{}}
% \author[1]{\href{mailto:<l.yin@tue.nl>?Subject=Your UAI 2022 paper}{Shiwei Liu}{}}
% Add affiliations after the authors

  \begin{document}
\maketitle

\begin{abstract}

% In concert with the increasingly impressive performance achieved by deep neural networks, the size of the cutting-edge models is also exploding. Larger models usually translate to greater resource demands and, by extension, greater energy costs, and pollution. 
Recent works on sparse neural network training have shown that a compelling trade-off between performance and efficiency can be achieved. Existing sparse training methods usually strive to find the best sparse subnetwork possible in one single run, without involving any expensive dense or pre-training steps. For instance, dynamic sparse training (DST), as one of the most prominent directions,  is capable of reaching a competitive performance of dense training by iteratively evolving the sparse topology during the course of training. In this paper, we argue that it is better to allocate the limited resources to create multiple low-loss sparse subnetworks and superpose them into a stronger one, instead of allocating all resources entirely to find an individual subnetwork. To achieve this, two desiderata are required: (1) efficiently producing many low-loss subnetworks, the so-called cheap tickets, within one training process limited to the standard training time used in dense training; (2) effectively superposing these cheap tickets into one stronger subnetwork without going over the constrained parameter budget. To corroborate our conjecture, we present a novel sparse training approach, termed \textbf{Sup-tickets}, which can satisfy the above two desiderata concurrently in a single sparse-to-sparse training process. Across various models on CIFAR-10/100 and ImageNet, we show that Sup-tickets integrates seamlessly with the existing sparse training methods and demonstrates consistent performance improvement. 

\end{abstract}


\begin{figure}[htbp]
\vskip 0.2in
\begin{center}
\centerline{\includegraphics[width=0.40\textwidth]{images/sup_tickets.pdf}}

\caption{The schematic view of Sup-tickets. Multiple subnetworks (cheap tickets) are efficiently produced within the last 10\% of the training time and are superposed into one single subnetwork with boosting performance while maintaining the target sparsity. We term the ``ultimate ticket'' as the final subnetwork used for inference. }
\label{fig:Sup-tickets}
\end{center}

\end{figure}


\section{Introduction}



Over the past years, large-scale deep learning models with billions, even trillions of parameters have improved the state-of-the-art in nearly every downstream task~\citep{shoeybi2019megatron,radford2021learning,fedus2021switch}. The compelling results achieved by these large-scale models motivate researchers to pursue increasingly gigantic models without thinking too much about the limited resources of our planet. Fortunately, many prior techniques for neural network acceleration have already been proposed, which can effectively trim down the memory requirements and computational costs while retaining high accuracy~\citep{mozer1989using,han2015deep,gale2019state}. 

Among them, sparse neural network training~\citep{mocanu2018scalable,evci2020rigging,bellec2018deep} stands out and receives growing attention recently due to its high efficiency in both the training and inference phases. Instead of inheriting well-performing sparse networks from a trained dense network, sparse training approaches typically start from a randomly initialized sparse network and only require training a subset of the corresponding dense network. Since this sparse-to-sparse training process does not involve any dense or pre-training steps, the memory requirements and the floating-point operations (FLOPs) are only a fraction of the traditional dense training. Nonetheless, naively training a sparse neural network from scratch leads to poor solutions in general compared with training a dense network~\citep{evci2019difficulty}. Dynamic sparse training (DST)~\citep{mocanu2018scalable} significantly improves the trainability of sparse networks by dynamically exploring new connectivities during training, while maintaining the fixed parameter count. Compared with methods that train with the fixed sparse connectivity~\citep{Mocanu2016xbm,lee2018snip}, DST substantially improves the expressibility of sparse networks, and thus leads to better generalization performance~\citep{liu2021we}. {However, the accuracy of extremely sparse subnetworks (e.g., at sparsity\footnote{The term sparsity refers to the proportion of the neural network's weights that are zero-valued.} 95\% or 90\%) usually remains below the full dense training under a regular training epoch number~\citep{evci2020rigging,liu2021sparse}. Enabling sparse training at extreme sparsities to match or even surpass the performance of dense training under a typical amount of training epochs will significantly 
benefit sparse training in practice.} 

% Existing sparse training methods usually strive to find the best sparse subnetwork (\textit{winning ticket}) possible.

Increasingly more evidence on sparse training~\citep{liu2021deep} and dense training~\citep{garipov2018loss,draxler2018essentially,fort2019large} reveal that many independent local optima exist in different low-loss basins of the loss landscape.  Inspired by these observations, we go one step further to pursue an approach that can boost the performance of sparse training by leveraging these widely-existing low-loss basins. Specifically, we propose  Superposing Tickets, or briefly~\textbf{Sup-tickets}, which could produce many subnetworks (cheap tickets) in one single run and then superposes all of them into one at the same sparsity. Doing so allows us to leverage the knowledge from various well-performing cheap tickets, while still maintaining the training and inference efficiency of sparse training.  Overall, we summarize our contributions below:


% Different from prior work on weight averaging~\citep{wortsman2022model,izmailov2018averaging} that  only studied on dense network, we are the first to explore how to produce and combine multiple \textit{sparse sub-networks} into a stronger one. In our method we increase diversity by producing subnetworks with different connectivities  and  employ weight averaging while considering the importance
% of the connectivities. Please note that this is also non-trivial and was not studied in the prior works of weight averaging, including SWA~\citep{izmailov2018averaging}. Since each sub-network has its unique sparse connectivities, averaging them leads to a new sub-network with more connections (we prune this denser network in further steps). Therefore, the final averaged subnetwork is a new one whose sparse connectivity is expected to be different from all the previous sub-networks. Thus, even though we use the averaging technique in our method, the resulting subnetwork is effectively a new (superposed) network.





% Combining the predictions of these subnetworks during test, ``FreeTickets'' outperforms the generalization performance of the dense solutions. Nevertheless, ``FreeTickets'' still requires multiple forward passes for prediction, leading to additional inference overhead compared with DST. Moreover, the lightweight ``FreeTickets'' variant also requires to extend the training time significantly. Thus, the benefits of ``Freetickets'' can not directly generalize to sparse neural network training. 

\begin{itemize}
    \item We propose Sup-tickets, a novel sparse training approach that produces and superposes many cheap yet well-performing subnetworks (cheap tickets) during one sparse-to-sparse training run. The ultimate superposed subnetwork achieves stronger results in predictive accuracy and uncertainty estimation while maintaining the target sparsity. 
    
    % Compared with previous related works, our method neither needs to extend the training time nor to perform the costly ensembling at test.
    
    \item  {
Sup-tickets is a general and versatile performance booster for sparse training, which seamlessly integrates with other state-of-the-art sparse training methods. We conduct extensive experiments to evaluate our method. Across various popular architectures on CIFAR-10/100 and ImageNet, Sup-tickets improves the performance of various sparse training methods without extending the training time.}
    
    \item  {More impressively, in conjunction with the advanced sparse training methods -- GraNet~\citep{liu2021sparse}, Sup-tickets boosts the performance of sparse training over the dense training on CIFAR-10/100 at extreme sparsity levels around 90\% $\sim$ 95\%, enhancing the great potentials of sparse training in practice.}
    
    
\end{itemize}






\section{Related Work}

\subsection{Sparse Neural Network Training}
Sparse neural network training is a thriving topic. It aims to train initial sparse neural networks from scratch and chase competitive performance with their dense counterparts, while using only a fraction of resources of the latter. According to whether the sparse connectivity dynamically changes or not during training, sparse training usually can be divided into static sparse training (SST) and dynamic sparse training (DST).

\textbf{Static sparse training} represents a class of methods that train initial sparse neural networks with a fixed sparse connectivity pattern throughout training. While the sparse connectivity is static, the choices of the particular layer-wise sparsity (i.e., sparsity level of every single layer) can be diverse. The most naive approach is sparsifying each layer uniformly, i.e., uniform sparsity~\citep{gale2019state}.~\citet{Mocanu2016xbm} proposed a non-uniform sparsity method that can be applied in Restricted Boltzmann Machines (RBMs) and achieves better performance than dense RBMs. Some works explore 
the expander graph to train sparse CNNs and show comparable performance against the corresponding dense CNNs~\citep{kepner2019radix}. Inspired by the graph theory, \textit{Erd{\H{o}}s-R{\'e}nyi} (ER)~\citep{mocanu2018scalable} and its CNNs variant \textit{Erd{\H{o}}s-R{\'e}nyi-Kernel} (ERK)~\citep{evci2020rigging} allocates lower sparsity to smaller layers, avoiding the layer collapse problem~\citep{tanaka2020pruning} and achieving stronger results than the uniform sparsity in general.

\textbf{Dynamic sparse training}, namely, trains initial sparse neural networks while dynamically adjusting the sparse connectivity pattern during training. DST was first introduced in Sparse Evolutionary Training (SET)~\citep{mocanu2018scalable} which initializes the sparse connectivity with a ER topology and periodically explores the parameter space via a prune-and-grow scheme during training. Following SET, weights redistribution is introduced to search for better layer-wise sparsity ratios while training~\citep{mostafa2019parameter,dettmers2019sparse}. The mainly-used pruning criterion of existing DST methods is magnitude pruning. The criterion used for weight regrowing varies from method to method. Gradient-based regrowth e.g., momentum~\citep{dettmers2019sparse} and gradient~\citep{evci2020rigging}, shows strong results in image classification, whereas random regrowth outperforms the former in language modeling~\citep{dietrich2021towards}. Follow-up works improve the accuracy by relaxing the constrained memory footprint ~\citep{yuan2021mest,liu2021sparse}. Very recently,~\citet{liu2021deep} proposed an efficient ensemble framework for sparse training-- FreeTickets. By directly ensembling the predictions of individual subnetworks, FreeTickets surpass the generalization performance of  the naive dense ensemble. Nevertheless, FreeTickets requires extending the training time to obtain multiple cheap subnetworks and performing multiple forward passes for inference, contrary to our pursuit of efficient training.


\subsection{Weight Averaging}

{Computing the convex combination of model weights usually leads to better robust performance~\cite{zhang2019lookahead,neyshabur2020being,wortsman2022model}. SWA~\citep{izmailov2018averaging} average weights along the same optimization trajectory with one single run. \cite{neyshabur2020being}, in contrast, merge models that start with the same initialization but are optimized independently.  Similarly, \cite{wortsman2022model} average models across many independent runs with various hyperparameters.  Different from these prior works that only study on dense networks, we explore for the first time how to produce and combine multiple \textit{sparse sub-networks}  into a stronger one while considering the importance of the connectivities.}

% Recently, an alternative approach to average model weights using Fisher information is proposed by ~\cite{matena2021merging}

% Besides, we increase diversity by producing subnetworks with different connectivities  and  employ weight averaging while considering the importance of the connectivities. Please note that this is also non-trivial and was not studied in the prior works of weight averaging.


% Since each sub-network has its unique sparse connectivities, averaging them leads to a new sub-network with more connections (we prune this denser network in further steps). Therefore, the final averaged subnetwork is a new one whose sparse connectivity is expected to be different from all the previous sub-networks. Thus, even though we use the averaging technique in our method, the resulting subnetwork is effectively a new (superposed) network.

% \paragraph{Dynamic Sparse Training.} Dynamic sparse training (DST) starts from Sparse Evolutionary Training (SET)~\citep{mocanu2018scalable,liu2021sparse} which initializes the sparse connectivity with \textit{Erd{\H{o}}s-R{\'e}nyi}~\citep{24gotErdos1959} topology and periodically explores the sparse connectivity via a prune-and-grow scheme during the course of training. DST has also demonstrated its versatility in broad fields such as feature detection~\citep{atashgahi2020quick}, lifelong learning~\citep{sokar2021spacenet}, federated learning~\citep{zhu2019multi,huang2022fedspa}, and adversarial
% training~\citep{ozdenizci2021training}.



\section{Methodology}



\begin{algorithm*}[tb]
   \caption{Sup-tickets}
   \label{alg:example}
\begin{algorithmic}[1]
\REQUIRE Network $f(\bm{x}; \bm{\mathrm{\theta}})$,  superposed subnetwork $\widetilde{\bm{\mathrm{\theta}}}_\mathrm{s}$, target sparsity $S$, training time $T$, cycle length $C$, learning rate $\alpha$, pruning criterion $\Psi$,  growing criterion $\Phi$,  pruning rate for parameter exploration $p$.


\STATE $f(\bm{x}; \bm{\mathrm{\theta}}_s)$ $\gets$  $f(\bm{x}; \bm{\mathrm{\theta}}; S)$  \Comment{Sparsely initialize the network}
\FOR {$i \leftarrow 1$ \bf{to} $T$}  
\IF{$i \leq 90\% T$}   \Comment{Normal sparse training for the first 90\% of T}
\STATE $f(\bm{x}; \bm{\mathrm{\theta}}_s) \gets$ $SparseTraining(f(\bm{x}; \bm{\mathrm{\theta}}_s))$ 
\ELSE                  \Comment{Creating and superposing cheap tickets in the last 10\% of T}
\STATE $\alpha \leftarrow \alpha(i) $                                                                  \Comment{Calculate the cyclical learning rate using Eq.~\ref{eq:cyc_schedule}}
\STATE $f(\bm{x}; \bm{\mathrm{\theta}}_s) \gets$ $SparseTraining(f(\bm{x}; \bm{\mathrm{\theta}}_s); \alpha)$ 
%\Comment{Sparse training with re-scheduled learning rate  (10\% T)}
\IF{$\bmod(i-90\%T, C) = 0$}     
    \STATE  $t \gets (i-90\%T)/C$              \Comment{Number of the created cheap tickets}
     
    \STATE 
$\widetilde{\bm{\mathrm{\theta}}}_\mathrm{s}^\mathrm{t} \gets \frac{(t-1) \cdot \widetilde{\bm{\mathrm{\theta}}}_\mathrm{s}^\mathrm{t-1} + \bm{\mathrm{\theta}}_\mathrm{s}^\mathrm{t}}{t}$  \Comment{  Ticket superposing using Eq.~\ref{eq:average} }
    \STATE 
    $ \widetilde{\bm{\mathrm{\theta}}}_\mathrm{s}^\mathrm{t} \gets MagnitudePruning(\widetilde{\bm{\mathrm{\theta}}}_\mathrm{s}^\mathrm{t})$ 
    \Comment{Prune the superposed ticket to the target sparsity $S$}
    \STATE 
    $\bm{\mathrm{\theta}}_{\mathrm{s}}^\prime  \gets \Psi(\bm{\mathrm{\theta}}_\mathrm{s},~p)$  \Comment{Parameter exploration using Eq.~\ref{eq:prune} and Eq.~\ref{eq:regrow}}  
    \STATE  $\bm{\mathrm{\theta}}_\mathrm{s} \gets  \bm{\mathrm{\theta}}_{\mathrm{s}}^\prime \cup     \Phi(\bm{\mathrm{\theta}}_{i \notin \bm{\mathrm{\theta}}_\mathrm{{s}^\prime}},~p)$   
    % \Comment{ Re-grow a fraction $p$ of parameters using Eq.~\ref{eq:regrow}}  
\ENDIF
\ENDIF
\ENDFOR 
\STATE Return $\widetilde{\bm{\mathrm{\theta}}}_\mathrm{s}$  \COMMENT{The ultimate ticket for test}

  
%  \STATE \textcolor{purple}{\# Create cheap tickets for superposing  ($10\% B$ )}
% \FOR {$i \leftarrow 0.9n, \ldots, n$} 

% \STATE $\alpha \leftarrow \alpha(i)$ using Eq.~\ref{eq:cyc_schedule} \COMMENT{Calculate LR for the iteration}
% \STATE { $f(x; \bm{\mathrm{\theta}}_s) \gets$ SGD$(f(\bm{x}\bm{\mathrm{\theta}}_s))$}\COMMENT{Gradients update}

%   \IF{$\bmod(i, \Delta \mathrm{T}) = 0$} 
%      \STATE{Connection topology search using Eq.~\ref{eq:prune} \&~\ref{eq:regrow}} 
%     \ENDIF
    
%     \IF{$\bmod(i, c) = 0$} 
%         \STATE Creating cheap ticket
%         \STATE Superposing cheap tickets using CIA
%   \ENDIF
% \ENDFOR\\

\end{algorithmic}
\end{algorithm*}


In this section, we introduce a new approach for sparse training, which could combines the benefits of multiple cheap tickets, without extra training time and multiple forward passes for inference\citep{garipov2018loss,liu2021deep}. We first introduce the basic training scheme of sparse training in Section~\ref{sec:sparse_training} and then describe our proposed Sup-tickets approach in detail in Section~\ref{sec:sup_ticket}.

\subsection{Prior Sparse Training Art}

\label{sec:sparse_training}

Following~\citet{liu2021we,liu2021deep}, we denote a sparse neural network as $f(\bm{x}; \bm{\mathrm{\theta}}_\mathrm{s})$. $\bm{\mathrm{\theta}}_\mathrm{s}$ refers to a subset of the full network parameters $\bm{\mathrm{\theta}}$ at a sparsity level of ${(1 - \frac{\|\bm{\mathrm{\theta}}_\mathrm{s}\|_0}{\|\bm{\mathrm{\theta}}\|_0}})$, where $\|\cdot\|_0$ is the $\ell_0$-norm. Sparse training typically initializes the network in a random fashion where the connections between two adjacent layers are sparsely and randomly connected, based on a pre-defined uniform or non-uniform layer-wise sparsity ratio\footnote{See~\citet{liu2022the} for the most common types of sparse initialization.}. In the i.i.d. classification setting with data $\{(x_i, y_i) \}_{i=1}^{\mathrm{N}}$, the goal of sparse training is to solve the following optimization problem: $\hat{\bm{\mathrm{\theta}}_\mathrm{s}} = \argmin_{\bm{\mathrm{\theta}}_\mathrm{s}} \sum_{i=1}^{\mathrm{N}} \mathcal{L}(f(x_i;\bm{\mathrm{\theta}}_\mathrm{s}),y_i)$, where $\mathcal{L}$ is the loss function.  SST keeps the sparse connectivity of the sparse network fixed after initialization. DST, on the other hand, dynamically adjusts the sparse connectivity via parameter exploration during training while sticking to a fixed sparsity level. The most widely used method for parameter exploration is the prune-and-grow scheme, i.e., pruning $p\%$ the least important parameters from the current subnetwork followed by a fraction $p\%$ of weight growing. Formally, the parameter exploration can be written as the following two steps:
\begin{equation}
    \bm{\mathrm{\theta}}_{\mathrm{s}}^\prime = \Psi(\bm{\mathrm{\theta}}_\mathrm{s},~p),
    \label{eq:prune}
\end{equation}
\begin{equation}
    \bm{\mathrm{\theta}}_\mathrm{s} = \bm{\mathrm{\theta}}_{\mathrm{s}}^\prime
    \cup \Phi(\bm{\mathrm{\theta}}_{i \notin \bm{\mathrm{\theta}}_\mathrm{{s}}^\prime},~p)
    \label{eq:regrow}
\end{equation}
where $\Psi$ and $\Phi$ are the specific pruning and growing criterion respectively. The choices of $\Psi$ and $\Phi$ differ from sparse training method to another. Besides the sparse structures, in the most sparse training literature~\citep{dettmers2019sparse,evci2020rigging,mostafa2019parameter,liu2021sparse}, it is usually a safe choice to keep the other training configurations, such as optimizers, hyperparameters, and learning rate schedules, the same as the normal dense training.  At the end of the training, sparse training can converge to a well-performing sparse subnetwork whose memory requirements, training, and inference FLOPs are only a fraction of the dense training.

% \moe{maybe move it to next subsection. However xx} %However, almost all the existing sparse training methods allocate all the limited resources to find the best sparse neural network possible. Even though low-loss and diverse subnetworks widely exist in the loss landscape of sparse neural network optimization, no prior works have ever explored how to find and leverage these handy subnetworks to boost the performance of sparse training under regular training epoch number. In the following section, we present Sub-ticket to close this research gap. 

\subsection{Sup-tickets}

%Superposing Tickets -- Sup-tickets}
\label{sec:sup_ticket}
Existing sparse training methods allocate all the limited resources to find the best sparse neural network possible. While low-loss subnetworks widely exist in the loss landscape of sparse neural network optimization~\citep{liu2020topological}, no prior works have ever explored how to find and leverage these handy cheap tickets to boost the performance of sparse training without extending training steps. In this section, we present Sub-tickets to close this research gap, as illustrated in Figure~\ref{fig:Sup-tickets}.

To achieve the above-mentioned ultimate goal, we need to satisfy the following two desiderata in one sparse-to-sparse training run: 
\begin{enumerate}
    \item \textbf{Creating cheap tickets}: Creating multiple cheap but well-performing subnetworks with one single run under a regular training time. We name such efficiently produced subnetworks as ``cheap tickets''.
    \item \textbf{Superposing tickets}: Superposing these subnetworks into one subnetwork at the same sparsity to avoid performing multiple forward passes for the prediction. We term the ``ultimate ticket'' as the final subnetwork used for inference.
\end{enumerate}
These two desiderata strictly follow the sparsity constraint of sparse training and thus maintain the training/inference efficiency of sparse training. 




\subsubsection{Creating Cheap Tickets}
%\textbf{Task 1:} 
% The first 90\% training procedure of Sup-tickets is the same as the existing sparse training approaches. Our unique features locate in 
During the last 10\% of the training time,  we cyclically explore the current sparse connectivity  and restart the learning rate to visit multiple low-loss sub-space basins.
%The pruning and growing criterion align with various DST methods that we aim the boost. 
More concretely, in each cycle, we first significantly change the connectivity of the current subnetwork by performing the parameter exploration once with Eq.~\ref{eq:prune} $\&$~\ref{eq:regrow}. For simplicity, we inherit the pruning and growing methods used in the sparse training methods that Sup-tickets combines with. After parameter exploration, we leverage the cyclical learning rate to force the current subnetwork to escape the local minima. Inspired by~\citet{garipov2018loss,izmailov2018averaging}, we adopt the learning rate schedule scheme as:


\begin{equation}
\label{eq:cyc_schedule}
\resizebox{.90\hsize}{!}{$
        \alpha(i)\!= \left\{ 
        \begin{array}{ll}
        (1 - 2 t(i)) \alpha_1 + 2 t(i) \alpha_2 &  0 < t(i) \le \frac{1}{2} \\
        (2 - 2 t(i)) \alpha_2 + (2 t(i) - 1) \alpha_1 & \frac{1}{2} < t(i) \le 1 
        \end{array} 
    \right.$}  
\end{equation}

where $\alpha(i)$ is the cyclical learning rate ranging from $\alpha_1$ to $\alpha_2$; $i$ is the training iteration for one mini-batch data; $t(i) = \frac{1}{C}(\bmod(i - 1, C) + 1)$; $C$ is the cycle length. We modify the cyclical learning rate schedule used in SWA~\citep{izmailov2018averaging} to prevent the aggressive rise of the learning rate. Specifically, we adopt the triangle-like schedule as shown in Figure~\ref{fig:learning rate reschedual}-bottom. In such a way, the learning rate could seamlessly transition from the normal training stage to the superposing stage. At the end of each cycle, we can obtain one cheap ticket from the current basin with 
diverse and meaningful representation.  

\begin{figure}[h]
\centering
    \subfigure{}{
        \includegraphics[width=0.45\textwidth]{images/lr_swa.pdf}
    }
    \subfigure{}{
        \includegraphics[width=0.45\textwidth]{images/lr_sup.pdf}
    }

\caption{\textbf{Top:} cyclical learning rate schedule of~\citet{garipov2018loss}. \textbf{Bottom:} cyclical learning rate schedule of Sup-tickets. Cheap tickets are collected at the end of each learning rate schedule cycle (green circles in the figure).}
\label{fig:learning rate reschedual}

\end{figure}


The combination of cyclical learning rate schedule and parameter exploration is also used in FreeTickets~\citep{liu2021deep}, but we have several distinctions to make it compiled with the requirements of sparse training. The cycle duration of FreeTickets is set as 100 epochs to guarantee the consistent strong performance of each subnetwork as they try to achieve comparable performance with the dense ensemble. However, such a long duration of cycle conflicts with the goal of sparse training. In particular, we reduce the cycle duration to 2 epochs for ImageNet, 8 epochs for CIFAR-10/100 and only use the final 10\% of the training time to generate cheap tickets. In this case, the overall training time is the same as training a single sparse network.   



\begin{figure*}[htbp]
\centering
    \subfigure{}{
        \includegraphics[width=01\textwidth]{images/Average_methods.pdf}
    }

    \subfigure{}{
        \includegraphics[width=0.8\textwidth]{images/legend.pdf}
    }

\caption{Comparisons of various averaging methods. We combine CIA, CAA, and CIMA  with RigL and report the test accuracy of the ultimate tickets. For CIMA, we vary the exponential decay rates $\beta \in [0.9, 0.8, 0.5, 0.2, 0.1]$.}
    \label{Average_methods}

\end{figure*}

 
\subsubsection{Superposing Tickets}\label{S_tickets}


Superposing multiple sparse networks is more complex than superposing multiple dense networks~\citep{cheung2019superposition,izmailov2018averaging}. Naively selecting all the weights that are activated in all cheap tickets will significantly increase the parameter count, as different subnetworks have different connectivities. To solve this task, we propose to perform \underline{weight averaging followed by weight pruning}. More concretely, assuming we collect $\mathrm{M}$  cheap tickets $\{\bm{\mathrm{\theta}}_\mathrm{s}^1, \bm{\mathrm{\theta}}_\mathrm{s}^2, ... , \bm{\mathrm{\theta}}_\mathrm{s}^\mathrm{M}\}$ at the end of training, we consider the following three ways to average them.



\textbf{Connection Independent Averaging (CIA).} The ultimate subnetwork averaged by CIA is given as: $\widetilde{\bm{\mathrm{\theta}}}_\mathrm{s^{\prime}} = \frac{1}{\mathrm{M}} \sum_{\mathrm{i=1}}^{\mathrm{M}} \bm{\mathrm{\theta}}_\mathrm{s}^\mathrm{i}$, where $\mathrm{M}$ is the total number of cheap tickets. CIA simply averages weights across all the cheap tickets without considering whether the connection is activated or not in each cheap ticket. CIA tends to preserve the connections that are activated in the majority of the cheap tickets whereas the ones that are occasionally activated in one or two cheap tickets are likely to have small magnitude after averaging by $\mathrm{M}$, unless they have extremely large values.

\textbf{Connection Aware Averaging (CAA).}   The ultimate subnetwork averaged by  CAA  is given as: $\widetilde{\bm{\mathrm{\theta}}}_\mathrm{s} = \frac{1}{\mathrm{N(k, j)}} \sum_{\mathrm{i=1}}^{\mathrm{M}} \bm{\mathrm{\theta}}_\mathrm{s}^\mathrm{i}$, where $\mathrm{N(k, j)}$ is the number of times the connection $\mathrm{\theta(k, j)}$ is activated across all the cheap tickets; $k$ is the $k^{th}$ neuron in the previous layer and  $j$ is the  $j^{th}$ neuron in this layer. Thus, we have $\mathrm{N(k, j)} \leq \mathrm{M}$. Compared with CIA, CAA pays more attention to the occasionally activated connections that are only existing in the minority of cheap tickets.


\textbf{Connection Independent Moving Averaging (CIMA).}
Motivated by the widely-used moving average technique~\citep{kingma2014adam,karras2017progressive}, we sequentially apply the popular moving averages over the cheap tickets obtained at each cycle. The averaged subnetwork over the first $t$ cheap tickets is given as:  $\widetilde{\bm{\mathrm{\theta}}}_\mathrm{s}^\mathrm{t} = \beta \widetilde{\bm{\mathrm{\theta}}}_\mathrm{s}^\mathrm{t-1}  + (1-\beta) \bm{\mathrm{\theta}}_\mathrm{s}^\mathrm{t}$.  $\beta$ controls the exponential decay rates. Larger $\beta$ will put more emphasis on the cheap tickets collected in the early time.

Note that the sparsity of the averaged subnetwork is likely larger than the target sparsity level. To maintain the same sparsity as the original subnetwork, we utilize magnitude weight pruning to remove the weights with the smallest magnitude after every averaging step.


\subsection{Memory and Computation Overhead} 


Instead of saving $\mathrm{M}$ cheap tickets and average them, we apply a similar operation as CIMA to save the extra memory required by CIA and CAA during training. The averaged subnetwork over the first $t$ cheap tickets is given as:
\begin{equation}
\widetilde{\bm{\mathrm{\theta}}}_\mathrm{s}^\mathrm{t} = \frac{(t-1) \cdot \widetilde{\bm{\mathrm{\theta}}}_\mathrm{s}^\mathrm{t-1} + \bm{\mathrm{\theta}}_\mathrm{s}^\mathrm{t}}{t}
\label{eq:average}
\end{equation}
This operation allows us to accomplish the average operation by maintaining only one extra copy of the averaged weights, instead of saving $\mathrm{M}$ subnetworks.


Moreover, as we mentioned, we use the final 10\% of the training time to create cheap tickets, and thus the training time of Sub-tickets is the same as the standard sparse training. Since we only need to perform Eq.~\ref{eq:average} for ($\mathrm{M}-1$) times, the extra computation cost of averaging is negligible compared with the total training costs. Overall, we can conclude that the training cost of Sub-tickets is approximately the same as training a single sparse network.



\section{Experiments}



Sub-tickets is a universal idea that can be  straightforwardly applied to any types of sparse training methods. To verify the effectiveness of Sup-tickets, we apply it to various sparse training methods, including 3 DST methods: SET, RigL~\citep{evci2020rigging}, and GraNet~\citep{liu2021sparse}; one SST method: ERK~\citep{evci2020rigging}; and one pruning at initialization approach: SNIP~\citep{lee2018snip}. 


\begin{table*}[htbp]
\centering
\caption{Test accuracy (\%) of sparse VGG-16 on CIFAR-10/100. All the results are averaged from three random runs. In each setting, the best results are marked in bold.}

\label{table:VGG16}
\resizebox{1.0\textwidth}{!}{
\begin{tabular}{lccc ccc}
\cmidrule[\heavyrulewidth](lr){1-7}

 \textbf{Dataset}     & \multicolumn{3}{c}{CIFAR-10} & \multicolumn{3}{c}{CIFAR-100}  \\ 
 \cmidrule(lr){1-1}
\cmidrule(lr){2-4}
\cmidrule(lr){5-7}
 
 \textbf{VGG-16}~(Dense) 
& 93.91$\pm$0.26  & - & - 
& 73.61$\pm$0.45  & - & - 
\\
 \cmidrule(lr){1-1}
\cmidrule(lr){2-4}
\cmidrule(lr){5-7}

Sparsity     & 95\%      & 90\%     & 80\%     
     &  95\%      & 90\%     & 80\%         \\ 
     
 \cmidrule(lr){1-1}
\cmidrule(lr){2-4}
\cmidrule(lr){5-7}

SET~\citep{mocanu2018scalable}
& 92.96$\pm$0.18 & 93.54$\pm$0.23   & 93.56$\pm$0.04
& 70.10$\pm$0.33 & 71.50$\pm$0.23  & 72.38$\pm$0.08

\\
SET+Sup-tickets (ours)
&\textbf{93.22$\pm$0.09}  & \textbf{93.63$\pm$0.05} & \textbf{93.80$\pm$0.13} &
\textbf{71.18$\pm$0.29}  & \textbf{71.99$\pm$0.27} & \textbf{73.02$\pm$0.32}
\\
\cmidrule(lr){1-1}
\cmidrule(lr){2-4}
\cmidrule(lr){5-7}
RigL~\citep{evci2020rigging}
& 92.70$\pm$0.08 & 93.48$\pm$0.16   &93.60$\pm$0.14
& 70.65$\pm$0.16   & 72.20$\pm$0.09  & 72.63$\pm$0.23 
\\
RigL+Sup-tickets (ours)
&\textbf{93.20$\pm$0.13}  &  \textbf{93.81$\pm$0.11}  & \textbf{93.85$\pm$0.25}
&\textbf{71.31$\pm$0.21} & \textbf{72.57$\pm$0.29} &  {\textbf{73.61$\pm$0.11}}
\\
\cmidrule(lr){1-1}
\cmidrule(lr){2-4}
\cmidrule(lr){5-7}
GraNet \citep{liu2021sparse}  
& 93.87$\pm$0.19 & 93.83$\pm$0.30 & 93.77$\pm$0.18
&72.91$\pm$0.39 & 73.48$\pm$0.17 & 73.36$\pm$0.14

\\
GraNet+Sup-tickets (ours)
& {\textbf{94.10$\pm$0.06}} &  {\textbf{94.13$\pm$0.12}} &  {\textbf{94.24$\pm$0.05}}
&  {\textbf{73.61$\pm$0.24}}&  {\textbf{73.87$\pm$0.26}} &  {\textbf{73.95$\pm$0.30}}

\\
\cmidrule[\heavyrulewidth](lr){1-7}

\end{tabular}}
\end{table*}




\begin{table*}[htbp]
\centering
\caption{Test accuracy (\%) of sparse ResNet-50 on CIFAR-10/100. All the results are averaged from three runs. In each setting, the best results are marked in bold.}
\label{table:RN50_CIFAR}
\resizebox{1.0\textwidth}{!}{
\begin{tabular}{lccc ccc}
\cmidrule[\heavyrulewidth](lr){1-7}

 \textbf{Dataset}     & \multicolumn{3}{c}{CIFAR-10} & \multicolumn{3}{c}{CIFAR-100}  \\ 
\cmidrule(lr){1-1}
\cmidrule(lr){2-4}
\cmidrule(lr){5-7}
\textbf{ResNet-50}~(Dense) 
& 94.88$\pm$0.11 & - & -  
& 78.00$\pm$0.40  & - & - 
\\
\cmidrule(lr){1-1}
\cmidrule(lr){2-4}
\cmidrule(lr){5-7}

Sparsity     & 95\%      & 90\%     & 80\%     
     &  95\%      & 90\%     & 80\%         \\ 
     
\cmidrule(lr){1-1}
\cmidrule(lr){2-4}
\cmidrule(lr){5-7}


SNIP~\citep{lee2018snip} 
& 94.01$\pm$0.28 & 94.81$\pm$0.36 & 94.91$\pm$0.16
& 41.25$\pm$1.10  & 68.79$\pm$1.16 & 75.29$\pm$1.28
\\

SNIP+Sup-tickets (ours)
& \textbf{94.33$\pm$0.09} & \textbf{95.05$\pm$0.22} & \textbf{95.21$\pm$0.09}
& \textbf{65.56$\pm$1.15} & \textbf{76.34$\pm$0.27} & \textbf{77.43$\pm$0.53}

\\
\cmidrule(lr){1-1}
\cmidrule(lr){2-4}
\cmidrule(lr){5-7}

ERK~\citep{evci2020rigging} 
& 93.44$\pm$0.22 & 94.41$\pm$0.13 & 94.85$\pm$0.21
& 74.49$\pm$0.30 &  76.36$\pm$0.22 & 77.41$\pm$0.08
\\

ERK+Sup-tickets (ours)
& \textbf{93.92$\pm$0.04} & \textbf{94.80$\pm$0.06} & \textbf{95.11$\pm$0.27}
& \textbf{75.75$\pm$0.28} & \textbf{76.82$\pm$0.08} & \textbf{77.85$\pm$0.42}
\\
\cmidrule(lr){1-1}
\cmidrule(lr){2-4}
\cmidrule(lr){5-7}

SET~\citep{mocanu2018scalable}
&94.49$\pm$0.11   & 94.73$\pm$0.27   & 94.74$\pm$0.17
& 76.59$\pm$0.54  & 77.79$\pm$0.27   & \textbf{78.45$\pm$0.50}  
\\

SET+Sup-tickets (ours)
&\textbf{94.81$\pm$0.05}   & \textbf{94.87$\pm$0.03}  & \textbf{94.90$\pm$0.27}
&\textbf{76.68$\pm$0.38}  & \textbf{77.89$\pm$0.45}   &  {78.35$\pm$0.18}
\\

\cmidrule(lr){1-1}
\cmidrule(lr){2-4}
\cmidrule(lr){5-7}
RigL~\citep{evci2020rigging}
& 94.59$\pm$0.19 & 94.70$\pm$0.17   & 94.70$\pm$0.07
& 76.96$\pm$0.39 & 77.95$\pm$0.36   & {78.19$\pm$0.51}
\\
RigL+Sup-tickets (ours)
& \textbf{94.65$\pm$0.11}  & \textbf{94.82$\pm$0.13} &  \textbf{94.81$\pm$0.15}
&\textbf{77.58$\pm$0.47}  &  {\textbf{78.52$\pm$0.39}}  &    {\textbf{78.69$\pm$0.30}}
\\
\cmidrule(lr){1-1}
\cmidrule(lr){2-4}
\cmidrule(lr){5-7}

GraNet \citep{liu2021sparse}  
&94.70$\pm$0.23 &  94.95$\pm$0.09 & 94.86$\pm$0.24
&77.47$\pm$0.22 &  {78.25$ \pm$0.51} & {78.80$\pm$0.46}
\\
GraNet+Sup-tickets (ours)
& {\textbf{94.89$\pm$0.15}} &  {\textbf{95.08$\pm$0.08}} &  {\textbf{94.94$\pm$0.03}}
&\textbf{77.70$\pm$0.47} &  {\textbf{78.37$\pm$0.53}} &  {\textbf{78.95$\pm$0.33}}
\\

\cmidrule[\heavyrulewidth](lr){1-7}
%\bottomrule
\end{tabular}}
\end{table*}



\begin{table*}[htbp]
\centering
\caption{Test accuracy (\%) of sparse ResNet-50 on ImageNet. The training FLOPs of sparse training methods are normalized with the FLOPs used to train a dense dense model. In each setting, the best results are marked in bold.}

\label{table:classification_ImageNet}
\resizebox{1.0\textwidth}{!}{
\begin{tabular}{l  ccc  ccc}
% \cmidrule[\heavyrulewidth](lr){1-7}
\cmidrule[\heavyrulewidth](lr){1-7}
\textbf{Method} & Top-1 & FLOPs & FLOPs & TOP-1 & FLOPs & FLOPs \\
& Accuracy & (Train) & (Test) & Accuracy & (Train) & (Test)
\\
\cmidrule(lr){1-1}
\cmidrule(lr){2-4}
\cmidrule(lr){5-7}
%  \cmidrule[\heavyrulewidth](lr){1-7}
\textbf{ResNet-50}~(Dense)  & 76.8$\pm$0.09 & 1x (3.2e18) & 1x (8.2e9) & 76.8$\pm$0.09 & 1x (3.2e18) & 1x (8.2e9)
\\
%  \cmidrule[\heavyrulewidth](lr){1-7}
\cmidrule(lr){1-1}
\cmidrule(lr){2-4}
\cmidrule(lr){5-7}
Sparsity & \multicolumn{3}{c}{80\%} & \multicolumn{3}{c}{90\%}
\\
%  \cmidrule[\heavyrulewidth](lr){1-7}
\cmidrule(lr){1-1}
\cmidrule(lr){2-4}
\cmidrule(lr){5-7}
Static sparse training (ERK) & 72.1$\pm$0.04 & 0.42$\times$ & 0.42$\times$ & 67.7$\pm$0.12 & 0.24$\times$ & 0.24$\times$ 
\\
Small-Dense & 72.1$\pm$0.06 & 0.23$\times$ & 0.23$\times$ & 67.2$\pm$0.12 & 0.10$\times$ & 0.10$\times$ 
\\
SNIP~\citep{lee2018snip} & 72.0$\pm$0.06 & 0.23$\times$ & 0.23$\times$ & 67.2$\pm$0.12 & 0.10$\times$ & 0.10$\times$ 
\\
% % GraSP & 72.06 & & & 68.14 & &  
% % \\
% % SynFlow & 72.18 & & & 68.26 & &
% % \\
% % FORCE & 70.96 & & & 69.78 & &
% % \\
% %  \cmidrule[\heavyrulewidth](lr){1-7}
% \cmidrule(lr){1-1}
% \cmidrule(lr){2-4}
% \cmidrule(lr){5-7}
SET~\citep{mocanu2018scalable} & 72.9$\pm${0.39} & 0.23$\times$ & 0.23$\times$ & 69.6$\pm${0.23} & 0.10$\times$ & 0.10$\times$ 
\\
DSR~\citep{mostafa2019parameter} & 73.3 & 0.40$\times$ & 0.40$\times$ & 71.6 & 0.30$\times$ & 0.30$\times$ 
\\
SNFS~\citep{dettmers2019sparse} & 75.2$\pm${0.11} & 0.61$\times$ & 0.42$\times$ & 72.9$\pm${0.06} & 0.50$\times$ & 0.24$\times$
\\
% \cdashlinelr{1-7}
\cmidrule(lr){1-1}
\cmidrule(lr){2-4}
\cmidrule(lr){5-7}
RigL~\citep{evci2020rigging} & 75.1$\pm$0.05 & 0.42$\times$ & 0.42$\times$ & 73.0$\pm$0.04 & 0.25$\times$ & 0.24$\times$ 
\\
RigL+Sup-tickets (ours) & \textbf{76.0} & 0.42$\times$ & 0.42$\times$ & \textbf{74.0} & 0.25$\times$ & 0.24$\times$ 
\\
\cmidrule(lr){1-1}
\cmidrule(lr){2-4}
\cmidrule(lr){5-7}
GraNet~\citep{liu2021sparse} & 75.9 & 0.37$\times$  & 0.35$\times$
&  74.4 & 0.25$\times$ & 0.20$\times$
\\
GraNet+Sup-tickets (ours)  & {\textbf{76.2}} & 0.37$\times$  & 0.35$\times$
&  {\textbf{74.6}} & 0.25$\times$ & 0.20$\times$
\\
\cmidrule[\heavyrulewidth](lr){1-7}
\end{tabular}}

\end{table*}


\subsection{Experimental Setups}



The experiments are conducted across various architectures on three popular datasets CIFAR-10/100 and ImageNet. For CIFAR-10/100, we choose models VGG-16~\citep{simonyan2014very}, Wide ResNet28-10~\citep{zagoruyko2016wide} and ResNet-50~\citep{he2016deep}. The models are trained for 250 epochs, optimized by momentum SGD with a learning rate of 0.1, which decayed by 10x at the half and three-quarters of the training stage. The cycle length is chosen as 8 epochs, so that we can obtain 3 cheap tickets in 24 epochs. The model used for ImageNet is ResNet-50, which is trained for 100 epochs, optimized by momentum SGD with a learning rate of 0.1 decaying by 10x at 30, 60, and 85 epoch. The cycle length of ImageNet is 2 epochs, so we obtain 4 cheap tickets in the last 8 epochs. The implementation details are reported in Appendix~\ref{sec:implementation}.


\subsection{Comparisons among CIA, CAA, and CIMA}


We first conduct a comparison among CIA, CAA, and CIMA on CIFAR-100 and report the results in Figure~\ref{Average_methods}. We can see that CIA consistently outperforms the other two methods at various sparsity levels. CAA is the worst-performing method, especially at the extreme sparsity 95\%. With tuned $\beta=0.8$, CIMA can approach the performance achieved by CIA. The better performance achieved by CIA over CAA indicates that the occasionally activated connections are likely unimportant. CIA pays more attention to the connections that exist in the majority of the cheap tickets, which can eliminate the unimportant connections that are activated occasionally. Therefore, due to the superior performance consistently achieved by CIA, we choose CIA as our averaging method in the following sections. 



\subsection{Evaluation of Sup-tickets}
\label{sec:experiments sup-tickets}
\textbf{CIFAR-10/100.} In this section, we provide an experimental comparison of Sup-tickets to a variety of sparse training techniques. The results of CIFAR-10/100 with VGG-16 and ResNet-50 are shown in Table~\ref{table:VGG16} $\&$~\ref{table:RN50_CIFAR} respectively, and the results of Wide ResNet28-10 are shared in Appendix~\ref{sec:WRN2810} due to the limited space. Overall, we clearly see that our approach could benefit sparse training across all studied architectures. Simple as it looks, Sup-tickets improves the
performance of various dynamic sparse training methods in 63 out of 66 cases. It seems Sup-tickets performs better with VGG-16 than the other two architectures, with up to 0.5\% and 1.08\% accuracy increase on CIFAR-10 and CIFAR-100, respectively. We also find that the performance improvement on CIFAR-100 is larger than the one on CIFAR-10, which makes sense since CIFAR-100 is less saturated and thus has a larger improvement space. More importantly, our approach combined with the state-of-the-art DST method -- GraNet, outperforms the dense networks with only about 5\% at most 10\% parameters with all architectures, as reported in Table~\ref{table:GraNet Dense}. All these results highlight that Sup-tickets is a strong and universal performance booster for sparse training. 




\begin{table}[!h]
    \centering
    \caption{{\small Performance comparison between  GraNet+Sup-tickets and dense network. Results that are better than the corresponding dense networks are marked in bold. WRN28-10 refers to Wide ResNet28-10. GraNet+Sup-tickets outperforms dense network in most cases.}}

    \label{table:GraNet Dense}
    \resizebox{0.48\textwidth}{!}{
    \begin{tabular}{lccccc}
        \hline
        \multirow{2}{*}{Dataset}  &  \multirow{2}{*}{Network} & \multirow{2}{*}{Dense}  & \multicolumn{3}{c}{ GraNet+Sup-tickets} \\
        \cmidrule(lr){4-6}
        & &   &   95\% sparsity & 90\% sparsity & 80\% sparsity   \\
        \cmidrule(lr){1-1}
        \cmidrule(lr){2-2}
        \cmidrule(lr){3-3}
        \cmidrule(lr){4-6}
        
        \multirow{3}{*}{CIFAR-10}  & VGG-16 & 93.91$\pm$0.26  &\textbf{{94.10$\pm$0.06}} & \textbf{{94.13$\pm$0.12}} & \textbf{{94.24$\pm$0.05}}\\ 
        & ResNet-50 & 94.88$\pm$0.11 &\textbf{{94.89$\pm$0.15}} & \textbf{{95.08$\pm$0.08}} & \textbf{{94.94$\pm$0.03}} \\
        & WRN28-10 &  96.00$\pm$0.13 & \textbf{{96.03$\pm$0.11}} & \textbf{{96.13$\pm$0.07}} & \textbf{96.08$\pm$0.04} \\


        \cmidrule(lr){1-1}
        \cmidrule(lr){2-2}
        \cmidrule(lr){3-3}
        \cmidrule(lr){4-6}
  
       \multirow{3}{*}{CIFAR-100} & VGG-16 & 73.61$\pm$0.45  & \textbf{{73.61$\pm$0.24}}& \textbf{{73.87$\pm$0.26}} & \textbf{{73.95$\pm$0.30}}\\
        & ResNet-50 & 78.00$\pm$0.40  &{77.70$\pm$0.47} & \textbf{{78.37$\pm$0.53}} & \textbf{{78.95$\pm$0.33}}\\ 
        & WRN28-10 & 81.09$\pm$0.19  &{80.65$\pm$0.06} & \textbf{{81.20$\pm$0.09}} & \textbf{{81.42$\pm$0.18}} \\ 
        
        \bottomrule
    \end{tabular}}

    \label{tab:performance_deberta}

\end{table}

\textbf{ImageNet.} For ImageNet, we apply Sup-tickets to RigL and GraNet and compare them with the existing sparse training methods. The results are reported the in Table~\ref{table:classification_ImageNet}. Again, we improve the performance of GraNet and RigL at both 80\% sparsity and 90\% sparsity without an extra parameter budget. Especially on RigL, our approach improves the test accuracy by 0.9\% and 1.0\% at sparsity 80\% and 90\%, respectively.  Besides, we compare the Sup-tickets with the naive deep ensemble method and show the results in Appendix~\ref{sec:deep ensemble}.

Examining the results, we note that Sup-tickets improve both SST and DST in all settings with a small operation modification of those algorithms. In all settings, a large array of other techniques are outperformed.



\section{Extensive Analysis}






\textbf{Cyclical Length.}  Here, we study how the cyclical length $C$ affects the Sup-tickets' performances. For all experiments, we still take the last 10\% of the training time for the generation of the cheap tickets, while altering the cyclical length as 2, 4, 8, and 12 epochs. The cheap ticket count then varies accordingly. The results are shown in Table~\ref{table:cyc_lenth}. In general, the intermediate lengths (i.e., $C=4$ or $C=8$) tend to achieve better accuracy than the extreme small or large lengths (i.e., $C=2$ or $C=12$). The results are expected since small lengths can not guarantee the high quality (high accuracy) of each cheap ticket, whereas large lengths naturally decrease the number of the collected tickets. Consequently, we use $C=8$ as the default setting in the main experiment section~\ref{sec:experiments sup-tickets}.

\begin{table}[htbp]
\centering
\caption{Test accuracy (\%) on CIFAR-100 of Sup-tickets combined with RigL under different cyclical lengths. The best results are marked in bold. }

\label{table:cyc_lenth}
\resizebox{0.4\textwidth}{!}{
\begin{tabular}{lccc}
\cmidrule[\heavyrulewidth](lr){1-4}

{\textbf{Cyclical}}  & \multicolumn{3}{c}{Pruning ratio } \\   
\cmidrule(lr){2-4}

{\textbf{length (epochs)}} & 95\%      & 90\%     & 80\%         
 \\ 
\cmidrule[\heavyrulewidth](lr){1-4}

\multicolumn{4}{c}{VGG-16} 
\\
\cmidrule[\heavyrulewidth](lr){1-4}
C=2
& 71.35$\pm$0.14 & 72.89$\pm$0.41 & \textbf{73.65$\pm$0.20}
\\
C=4
&\textbf{71.42$\pm$0.19} &\textbf{73.00$\pm$0.20} & 73.62$\pm$0.40
\\
C=8
& 71.31$\pm$0.21 & 72.57$\pm$0.29 &73.61$\pm$0.11
\\
C=12
& 71.27$\pm$0.06 & 72.69$\pm$0.43 &73.45$\pm$0.06
\\
\cmidrule[\heavyrulewidth](lr){1-4}
\multicolumn{4}{c}{ResNet-50} 
\\
\cmidrule[\heavyrulewidth](lr){1-4}
C=2
&\textbf{77.58$\pm$0.22} & 78.48$\pm$0.45 & 78.50$\pm$0.32
\\
C=4
&77.33$\pm$0.26 & \textbf{78.52$ \pm$0.36} & 78.62$\pm$0.34
\\
C=8
&\textbf{77.58$\pm$0.47} & \textbf{78.52$\pm$0.39} & \textbf{78.69$\pm$0.30}
\\
C=12
& 77.17$\pm$0.42 &  78.39$\pm$0.43 & 78.48$\pm$0.38
\\
\cmidrule[\heavyrulewidth](lr){1-4}
\end{tabular}
}

\end{table}


\looseness=-1 \textbf{Number of Cheap Tickets.} To study the effect of the cheap ticket count on ultimate ticket's performance, we alter the cheap ticket count with 2, 4, and 7, and fix the cyclical length as 8 epochs. The overall training time is set as 250 epochs. Under this setting, the time used for ticket generation is not fixed as 10\%, but it changes according to the cheap ticket count. We report the results in Figure~\ref{fig:different_tickets_number}-left. It could be seen that our approach achieves the best performance under four tickets, not the largest nor the smallest ticket count, apparently since creating too many cheap tickets will reduce the time of the normal sparse training phase, and thus yielding cheap tickets with poor performance. We further prove this in Figure~\ref{fig:different_tickets_number}-right. On the other hand, 2 cheap tickets are too few to boost the performance. Figure~\ref{fig:different_tickets_number} also illustrates the effectiveness of Sup-tickets, where the superposed subnetworks outperform the individual subnetworks by a large margin.

\begin{figure}[h]

\centering

    \subfigure{}{
        \includegraphics[width=0.23\textwidth]{images/tickets_num_supticket.pdf}
    }
    \subfigure{}{
        \includegraphics[width=0.23\textwidth]{images/tickets_num_mean.pdf}
    }

    \subfigure{}{
        \includegraphics[width=0.5\textwidth]{images/tickets_num_legend.pdf}}
\caption{Impacts of the cheap tickets count. Experiments are conducted with Wide ResNet28-10 trained with RigL+Sup-tickets on CIFAR-100. \textbf{Left:} test accuracy of the ultimate tickets.  \textbf{Reft:} the mean accuracy of the individual cheap tickets used to build the ultimate tickets. }

\label{fig:different_tickets_number}

\end{figure}


\looseness=-1 {The fixed training time constraint is important to enable comparisons among various sparse training methods since training efficiency is one of the main contributions of sparse training. It is natural to evaluate whether Sup-tickets can lead to continuous improvement when we remove this constraint. To evaluate this, we simply extend the overall training time to yield more cheap tickets. The results are reported in Appendix~\ref{sec:ticket_count}. We can see that the performance of Sup-tickets continuously improves as the number of tickets increases.}



{\textbf{Diversity Analysis.}  We report the diversity of the different subnetworks we obtained during training using KL divergence and prediction disagreement, which are widely used for deep ensembling~\cite{liu2021deep,fort2019deep}. We compare our methods against the traditional dense ensemble and two state-of-the-art efficient ensemble methods, including TreeNet~\citep{lee2015m} and BatchEnsemble~\citep{wen2020batchensemble}, with Wide ResNet28-10 on CIFAR-10. The results are also in line with our intuition. We  observe that the diversity of cheap tickets obtained by our method is lower than the traditional dense ensemble. This makes sense since networks of the traditional dense ensemble are obtained by different runs and should converge to different basins, whereas cheap tickets obtained by our methods are intended to be located in the same basin with relatively lower diversity. Nevertheless, our method still maintains a similar or even higher diversity than TreeNet and BatchEnsemble, verifying its effectiveness. The relatively low diversity ensures that our cheap tickets are located in the same wide and flat low loss region, which is actually crucial for the success of weight averaging, since too diverse networks could lead to very poor performance from the previous experiments~\cite{izmailov2018averaging,wortsman2021learning}. }




\begin{table}[!ht]
\centering
\tiny
\caption{{Prediction disagreement and KL divergence among various  ensemble methods.}}

\label{tab:diversity_cheaptickets}


\resizebox{0.45\textwidth}{!}{
\begin{tabular}{lcc}
\toprule
Methods &  {$d_{\text{dis}}$ ($\uparrow$)}  & {$d_{\mathrm{KL}}$  ($\uparrow$)} \\
\midrule
TreeNet~\cite{lee2015m}                    & 0.010 & 0.010 \\ 
BatchEnsemble~\cite{wen2020batchensemble}              & 0.014 & 0.020 \\ 
% EDST Ensemble              & 0.031 & 0.073 \\
% MIMO                       & 0.032 & 0.086 \\
\cmidrule(lr){1-3}
SET+Sup-tickets  (ours)          & 0.015 & 0.015 \\
Rigl+Sup-tickets  (ours)           & 0.017 & 0.015 \\
% GraNet+Sup-tickets (ours)           & 0.037 & 0.022 \\
\cmidrule(lr){1-3}
Traditional Dense Ensemble             & 0.032 & 0.086 \\

\bottomrule
\end{tabular} }
\end{table}





\begin{figure*}[htbp]
\centering
    \subfigure{}{
        \includegraphics[width=0.23\textwidth]{images/ECE_VGG-16_CIFAR-10.pdf}}
    \subfigure{}{
        \includegraphics[width=0.23\textwidth]{images/ECE_ResNet-50_CIFAR-10.pdf}}
    \subfigure{}{
        \includegraphics[width=0.23\textwidth]{images/ECE_VGG-16_CIFAR-100.pdf}}
    \subfigure{}{
        \includegraphics[width=0.23\textwidth]{images/ECE_ResNet-50_CIFAR-100.pdf}}
        

    \subfigure{}{
        \includegraphics[width=0.23\textwidth]{images/NLL_VGG-16_CIFAR-10.pdf}}
    \subfigure{}{
        \includegraphics[width=0.23\textwidth]{images/NLL_ResNet-50_CIFAR-10.pdf}}
    \subfigure{}{
        \includegraphics[width=0.23\textwidth]{images/NLL_VGG-16_CIFAR-100.pdf}}
    \subfigure{}{
        \includegraphics[width=0.23\textwidth]{images/NLL_ResNet-50_CIFAR-100.pdf}}

     \subfigure{}{
        \includegraphics[width=0.45\textwidth]{images/ECE_NLL_legend.pdf}}       

\caption{Comparison between RigL and RigL+Sup-tickets in terms of ECE and NLL.}
    \label{fig:uncertainty}

\end{figure*}


{\textbf{Comparison with Different Learning Rate Schedules.}
We compare our method with two learning rate schedule baselines: the learning rate schedule used in FGE~\citep{garipov2018loss} and the learning rate schedule used in SWA~\citep{izmailov2018averaging}. In all  schedules, Sup-tickets are collected at the lowest learning rate stage, and we fixed the learning rate range of these schedules for a fair comparison. Below we report the results on CIFAR-100. All the results are averaged from 3 random runs. It could be seen that our method surpasses the other baselines in 5 out of 6 cases. }



\begin{table}[htbp]
\centering
\caption{{Effect of Various Different Learning Rate (LR) Schedules.}}

\label{table:lr_s}
\resizebox{0.4\textwidth}{!}{
\begin{tabular}{lccc}
\cmidrule[\heavyrulewidth](lr){1-4}

{\textbf{LR schedule}}  & \multicolumn{3}{c}{Sparsity } \\   
\cmidrule(lr){2-4}

{\textbf{Method}} & 95\%      & 90\%     & 80\%         
 \\ 
\cmidrule[\heavyrulewidth](lr){1-4}

\multicolumn{4}{c}{VGG-16} 
\\
\cmidrule[\heavyrulewidth](lr){1-4}
LR of FGE~\citep{garipov2018loss}
& 70.66$ \pm$0.25  & 72.47$\pm$0.44 & 73.22$\pm$0.23
\\
LR of SWA~\citep{izmailov2018averaging}
& 71.26$\pm$0.16& \textbf{72.77$\pm$0.37} & {73.44$\pm$0.19}
\\
Sup-ticket (Ours)
&\textbf{71.31$\pm$0.21} & {72.57$\pm$0.29} &  {\textbf{73.61$\pm$0.11}}
\\
\cmidrule[\heavyrulewidth](lr){1-4}
\multicolumn{4}{c}{ResNet-50} 
\\
\cmidrule[\heavyrulewidth](lr){1-4}
LR of FGE~\citep{garipov2018loss}
&  77.30$\pm$0.67  & 78.20$\pm$0.53 &  78.35$\pm$0.35
\\
LR of SWA~\citep{izmailov2018averaging}
& 77.30$\pm$0.36 & 78.39$\pm$0.38 & 78.48$\pm$0.35
\\
Sup-ticket (Ours)
&\textbf{77.58$\pm$0.47}  &  {\textbf{78.52$\pm$0.39}}  &    {\textbf{78.69$\pm$0.30}}
\\
\cmidrule[\heavyrulewidth](lr){1-4}
\end{tabular}
}

\end{table}


We adjust the learning rate schedule slightly so that the learning rate gradually rises to an increased but still small value (0.005) and then decays to the lowest value (0.001) in each cycle. Such a smooth schedule ensures that the new cheap tickets only bounce within the same basin instead of jumping out of it. To help us clarify this, we added extra experiments and report the results in Appendix~\ref{sec:large lr}.







\textbf{Batch Normalization.} When there are batch normalization (BN) layers~\citep{ioffe2015batch} in the model, traditional weight averaging approaches~\citep{garipov2018loss,izmailov2018averaging} usually run one additional pass over the data to calculate the mean and standard deviation of these layers. Differently, we retrieve these statistics by simply averaging the mean and standard deviation of the BN layers in all cheap tickets without extra forward pass. To avoid extra memory occupation during implementation, similar to the weights averaging operation in Eq.~\ref{eq:average}, we calculate the superposed ticket's BN statistics $\widetilde{\bm{\mathrm{\theta}}}_\mathrm{bn}^\mathrm{t}$  across the first $t$ cheap tickets   using $\frac{(t-1) \cdot \widetilde{\bm{\mathrm{\theta}}}_\mathrm{bn}^\mathrm{t-1} + \bm{\mathrm{\theta}}_\mathrm{bn}^\mathrm{t}}{t}$, where $\bm{\mathrm{\theta}}_\mathrm{bn}^\mathrm{t}$ is the mean and standard deviation from $t^{th}$ cheap ticket's BN layers. The comparison between test accuracy under these two strategies is reported in Appendix~\ref{sec:bn}.








\textbf{Uncertainty Estimation.} In the security-critical scenarios, e.g., self-driving, medical treatment, classifiers should not only 
be accurate but also indicate when they are likely to be incorrect~\citep{guo2017calibration}. We further evaluate the performance of our approach on uncertainty estimation. We choose two widely-used metrics, expected calibration error (ECE)~\citep{guo2017calibration} and negative log-likelihood (NLL)~\citep{quinonero2005evaluating} to enable uncertainty comparisons among different methods. We apply Sup-tickets to RigL and compare it with the vanilla RigL in Figure~\ref{fig:uncertainty}. As observed, in addition to the improvement of accuracy, Sup-tickets also achieves stronger uncertainty estimation performance over RigL, and such improvement can likely generalize to other sparse training methods.


\section{Conclusion}

In this paper, we presented a novel sparse training approach, Sup-tickets, which effectively produces many cheap subnetworks (tickets) during training and superposes them into one stronger ultimate subnetwork. Sup-tickets is easily combined with existing techniques, agnostic to model architectures, datasets, and is able to boost the sparse training performance with only a negligible amount of extra FLOPs. Across various scenarios, consistent performance improvement is obtained by Sup-tickets in terms of accuracy as well as uncertainty estimation, under the same training time used by the standard sparse training methods. 
It is impressive to see that sup-tickets outperforms the corresponding dense networks on CIFAR-10/100 even in extremely sparse situations when collaborating with GraNet.


There are many potential directions to be explored in the future. For example, even if Sup-tickets enable sparse neural networks to match or outperform their dense counterparts in terms of test accuracy, do they learn the same representation as the latter learn? Besides, we hope the superior performance achieved by Sup-tickets could inspire more researchers to invest in developing hardware accelerators that have better support for sparse training.




% \textbf{Prediction diversity of subnetworks (not sure)}

% Measure the prediction diversity and the KL divergence of different subnetworks vs EDST (mention in advance to avoid confusing. For sparse training not ensemble with increasing FLOPs budget.)

%   % this can be done in the experiments
% \textbf{To do if have time}
% 1)Effect of the last tickets before LR restart

% 2) sparsity ratios training from scratch vs ERK?   CIFAR-10/100 

% 3) how many connections of cheap tickets is overlapped. 


\clearpage
\newpage

\bibliography{yin_307}






\end{document}


