% \documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like

%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
% Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
 % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
    \bibliographystyle{plainnat}
    % \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

% my packages
\usepackage{microtype}
\usepackage{graphicx}
% \usepackage{subfigure}
\usepackage{subcaption}
\usepackage{bbm}
% \usepackage{bm}
\usepackage{amsfonts}
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{cleveref}
\usepackage{natbib} % has a nice set of citation styles and commands
\usepackage{amsmath} % for aligned
\usepackage{mathtools}
\newtheorem{definition}{Definition}
\usepackage{multirow}
\usepackage{booktabs}
\usepackage{siunitx}
\usepackage{color}
\usepackage{colortbl}
\usepackage{xcolor}
\usepackage{array}
\DeclareMathOperator*{\argmin}{arg\,min}

% for cross referencing the main text
% PLEASE ONLY USE xr IN THE SUPPLEMENTARY MATERIAL. 
% In the main paper, hard code any cross-reference to the supplementary material. 
\usepackage{xr}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Quantifying lottery tickets under label noise:\\ accuracy, calibration, and complexity\\(Supplementary Material)}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<varora@sissa.it>?Subject=Your UAI 2023 paper}{Viplove Arora}{}}
\author[2]{Daniele Irto}
\author[1]{Sebastian Goldt}
\author[1]{Guido Sanguinetti}
% Add affiliations after the authors
\affil[1]{%
    Theoretical and Scientific Data Science\\
    SISSA\\
    Trieste, Italy
}
\affil[2]{%
    Data Science and Scientific Computing\\
    University of Trieste\\
    Trieste, Italy
}

\makeatletter
\newcommand*{\addFileDependency}[1]{% argument=file name and extension
  \typeout{(#1)}
  \@addtofilelist{#1}
  \IfFileExists{#1}{}{\typeout{No file #1.}}
}
\makeatother

\newcommand*{\myexternaldocument}[1]{%
    \externaldocument{#1}%
    \addFileDependency{#1.tex}%
    \addFileDependency{#1.aux}%
}
\myexternaldocument{arora_437}

\setcounter{table}{0}
\renewcommand{\thetable}{S\arabic{table}}%
\setcounter{figure}{0}
\renewcommand{\thefigure}{S\arabic{figure}}%
\setcounter{equation}{0}
\renewcommand{\theequation}{S\arabic{equation}}
\setcounter{section}{0}
\renewcommand{\thesection}{S\arabic{section}}
  
\begin{document}
  
\onecolumn %% Turn this off if single column is desired for the supplement
\maketitle
\appendix
\section{Mixture Classification Datasets}
\label{app:mixt_class}
A common technique in the study of neural network theory is to operate in a controlled setting, where models and data are simplified to allow analytical computations for otherwise intractable objects. An example of such an approach is the teacher-student framework \citep{seung1992statistical}, where the dataset is created by a neural network built for that purpose (teacher) and another neural network is tasked with learning on that data (student). For our analysis, we create two types of balanced binary classification datasets using Gaussian mixtures with different characteristics and varying levels of difficulty. Using the mixture model allowed us to create datasets that are easy to understand and can be easily visualised in a two-dimensional space. 

We consider two settings for mixture classification. The \emph{linear} dataset consists of two clusters that can be separated using a linear function. The two clusters have different means $\mu_1 \neq \mu_2$ but same covariance $\Sigma_1 = \Sigma_2$. The \emph{XOR} dataset was created such that the resulting clusters are placed like the graphic representation of the XOR logical function. We assume that the difficulty of the classification task would increase going from linear to XOR dataset. Data was sampled using a Gaussian mixture model conditioned on a predefined label $y$. We also define a coefficient of separation $\nu$ to modulate the distance between clusters. 

To properly define the formulation of our mixture models, we use an approach similar to that used by \citet{refinetti2021classifying}. The data distribution for a single input sample $\mathbf{x} \in \mathbb{R}^D$ given class $y$ sampled uniformly at random is:

\begin{equation}
    p(\mathbf{x},y) = p(y) \; p(\mathbf{x}|y), \quad p(\mathbf{x}|y) = \sum_{\alpha \in \mathcal{S}^T(y)} \mathcal{P}_\alpha \, \mathcal{N}(\mathbf{\mu}_\alpha , \mathbf{\Sigma}_\alpha).
\end{equation}

where $p(\mathbf{x}|y)$ is the probability of sampling $\mathbf{x}$ conditioned on the class $y \in \{0, 1\}$ of the sample. Each $\mathbf{x}$ is sampled using a multivariate normal distribution. $\mathcal{S}^T(y)$ is the set of all possible indexes for class $y$ and dataset type $T$. $\mathcal{P}_\alpha$ is the probability of the $\alpha$-th $D$-dimensional multivariate normal distribution $\mathcal{N}(\mathbf{\mu_\alpha} , \mathbf{\Sigma_\alpha})$, which depends on the size of $\mathcal{S}^T(y)$. This formulation can easily be used to generate data with a generic number of clusters, classes, and dataset types. In particular, more classes and dataset types can be included by defining proper, additional sets of indexes $\mathcal{S}^T(y)$. For our experiments, we only considered two classes. %Specific details on the three types of datasets can be seen in Section \ref{sec:data_gen}.

\subsection{Data Generation Process}
\label{sec:data_gen}
% The goal of generating data using Gaussian mixture models is to observe how the test and train error curves of different neural networks subjected to iterative pruning change with the difficulty of the classification task. The mixture classification task provides a way to create artificial datasets and modulate their difficulty. This allows us to obtain data that could be arranged in clusters according to their corresponding classes. The difficulty of the classification task is inversely proportional to the distance between these clusters. That is, a task is more difficult the closer the clusters corresponding to the two classes are, and it is easier if the clusters are more easily separable. To better highlight the clusters in our data, we relied on Principal Component Analysis (PCA) to reduce the data to two dimensions and plot the resulting principal components in a 2D plot, where the clusters were easily representable and identifiable. The process that we are going to describe in the following sections can be used for generating both the training and the testing set. Note that the labels in the train set are corrupted to clearly produce the two double descent curves.

\paragraph{Labels:}
The first step in generating the data was to create a vector $y$ of classes $0$ or $1$ of size equal to the desired number of training samples $N$. This was done by sampling $N$ values from a uniform distribution $\mathcal{U}(0,1)$ and reassigning them to values $0$ or $1$ depending on whether they were $ \le 0.5$ or $ > 0.5$, respectively. The resulting vector $\mathbf{y}$ is an array of values $0$ or $1$ in balanced proportions and it corresponds to the labels of the training samples in our dataset. 

\paragraph{Linear dataset:}
For each observation $\mathbf{x_i}$, $i=1,\dots,N$, of the linear dataset, the set of indexes $\mathcal{S}^L(y)$ has only one element for each class. This means that, for each class, the input points can be sampled by one multivariate normal:

\begin{subequations}
\label{eq:linear_mixture_model}
\begin{gather}
\mathcal{S}^L(y=0) = \{\alpha_0\} \; \rightarrow \; \mathbf{\mu}_{\alpha_0} = 0 \cdot \mathbbm{1}^D
\\
\mathcal{S}^L(y=1) = \{\alpha_1\} \; \rightarrow \; \mathbf{\mu}_{\alpha_1} = \nu \cdot \mathbbm{1}^D
\\
\mathbf{\Sigma}_{\alpha_0} = \mathbf{\Sigma}_{\alpha_1} = I_D.
\end{gather}
\end{subequations}

$\mathbbm{1}^D$ is a $D$-dimensional vector of all ones that can be multiplied by a scalar, meaning that all its elements get multiplied by that scalar. $I_D$ is the $D \times D$ identity matrix, and its elements can be multiplied by a scalar number as well\footnote{This notation is used consistently in this section, to indicate the means and covariances of the normal distributions.}. The distance between the two clusters, which makes the two classes more or less discernible, can be changed by simply increasing or decreasing the value of the $\nu$ coefficient. Performing PCA on the linear dataset and plotting the first two principal components yields the clusters shown in \cref{fig:linear_ds}. In those plots, it is possible to observe a clear distinction between the clusters belonging to the two classes. We can also see the change in distance between the clusters as $\nu$ is changed.

% \paragraph{Quadratic dataset:}
% For each observation $\mathbf{x_i}$, $i = 1, \dots, N$ of the quadratic dataset, the cardinality of the set of indexes is $1$ for both classes, as well. The only difference from the previous dataset is that the covariance matrices are different:

% \begin{subequations}
% \label{eq:quadratic_mixture_model}
% \begin{gather}
% \mathcal{S}^Q(y=0) = \{\alpha_0\} \; \rightarrow \; \mathbf{\mu}_{\alpha_0} = 0 \cdot \mathbbm{1}^D
% \\
% \mathcal{S}^Q(y=1) = \{\alpha_1\} \; \rightarrow \; \mathbf{\mu}_{\alpha_1} = \nu \cdot \mathbbm{1}^D
% \\
% \mathbf{\Sigma}_{\alpha_0} = I_D
% , \quad
% \mathbf{\Sigma}_{\alpha_1} = 2^2 \odot I_D.
% \end{gather}
% \end{subequations}

% Since the observations belonging to class $y=1$ are sampled from a distribution with larger covariance, the corresponding samples are more scattered in the $D$-dimensional space and they appear as a more inflated cluster in the principal components plot, as can be seen in Figure \ref{fig:quadratic_ds}. Given their mutual placement, the clusters of the two classes are not optimally separable by a linear function but they require a quadratic one.

\paragraph{XOR dataset:}
For the second dataset, our goal was to make the data appear as four separate clusters placed like the visual representation of the XOR logical operator. In this case, the sets of indexes corresponding to class $y=0$ and $y=1$ that have two elements each. Thus, the data points of each class can be sampled from two different distributions with equal probabilities. For class $y=0$, the samples are generated like the linear dataset described in \cref{eq:linear_mixture_model}. For the other class, its main feature is that the mean vectors of the multivariate normal distributions consist of two $\frac{D}{2}$-dimensional halves with different values:

\begin{subequations}
\label{eq:xor_mixture_model}
\begin{gather}
\mathcal{S}^X(y=0) = \{\alpha_0^A, \alpha_0^B \} \; \rightarrow 
    \begin{cases}
     \mathbf{\mu}_{\alpha_0^A} = 0 \cdot \mathbbm{1}^D \\
     \mathbf{\mu}_{\alpha_0^B} = \nu \cdot \mathbbm{1}^D
    \end{cases}
\\
\mathcal{S}^X(y=1) = \{\alpha_1^A, \alpha_1^B \} \; \rightarrow 
    \begin{cases}
     \mathbf{\mu}_{\alpha_1^A} = [0 \cdot \mathbbm{1}^\frac{D}{2} \,, \, \nu \cdot \mathbbm{1}^\frac{D}{2}] \\
     \mathbf{\mu}_{\alpha_0^B} = [\nu \cdot \mathbbm{1}^\frac{D}{2} \,, \, 0 \cdot \mathbbm{1}^\frac{D}{2}]  
    \end{cases}
\\
\mathbf{\Sigma}_{\alpha_0^A} = \mathbf{\Sigma}_{\alpha_0^B} = \mathbf{\Sigma}_{\alpha_1^A} = \mathbf{\Sigma}_{\alpha_1^B} = I_D.
\end{gather}
\end{subequations}

The PC plots of this dataset are shown in \cref{fig:XOR_ds}, where we can see the positioning of the four clusters in a cross-like layout.

\section{Pruning and rewinding} \label{app:IMP}
In our experiments, training is always followed by a series of pruning iterations that were performed according to the IMP \citep{han2015learning, frankle2018lottery} technique, which is described below:

\begin{enumerate}
    \item Initialise the model with weight $\mathcal{W}_0$, obtaining a model function $f(x \, ; \, \mathcal{W}_0)$.
    \item Train the model for the desired number of epochs $j$, reaching a set of weights $\mathcal{W}_j$.
    \item Sort all the weights in $\mathcal{W}_j$ by their absolute value and select the lowest $p \%$, where $p$ is a number between 0 and 100.
    \item Prune the selected weights by applying a mask to the original model, obtaining $f(x \, ; \, \text{mask} \odot \mathcal{W}_j)$. 
    \item Reset the remaining parameters to their first initialisation values, obtaining $f(x \, ; \, \text{mask} \odot \mathcal{W}_0)$. 
    \item Iterate the process $n$ times, pruning $p\%$ of the remaining connections at each iteration.
\end{enumerate}

In certain cases, the larger models need to be pruned using the lottery ticket rewinding technique \citep{frankle2019stabilizing}. Rewinding simply consists of modifying only one step of the procedure described above:

\begin{enumerate}
\item[5] Reset the remaining parameters to the values at iteration $k \ll j$ of the training loop, obtaining $f(x \, ; \, \text{mask} \odot \mathcal{W}_k)$
\end{enumerate}

where $k$ is a hyperparameter representing the number of the rewinding iterations. This technique is also conveniently included in the \texttt{OpenLTH}\footnote{\url{https://github.com/facebookresearch/open_lth}} library.

\section{Model hyperparameters}
\label{app:hyperparameters}
\subsection{Mixture classification}
We used two-layer fully connected networks without bias for our experiments on the two mixture classification datasets. Models with $P \in \{1, 3, 5, 8, 10, 15, 20, 25, 50, 100, 200, 500, 1000, 10000\}$ neurons in the hidden layer were used to produce the double descent curve. These models were subsequently pruned using IMP to produce the sparse double descent curves. Network weights are initialised using the Kaiming uniform distribution. The activation function chosen for this model is the Rectified Linear Unit (ReLU). Stochastic gradient descent with a learning rate of 0.1 was used as the optimiser. All models were trained for \num{1000} epochs using a batch size of \num{1024}. All models were sufficiently pruned to observe the sparse double descent curves. All experiments were replicated five times.

\paragraph{Linear datasets:}
We varied the distance between cluster means $\nu \in \{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0\}$ to see how it impacts the test error and the effective number of parameters. 5\% random label noise in the training set was used to observe the two double descent curves.

\paragraph{XOR datasets:}
We varied the distance between cluster means $\nu \in \{0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3\}$ to see how it impacts the test error and the effective number of parameters. Higher values of $\nu$ were needed to ensure sufficient distance between the clusters. Higher label noise of 25\% was needed to consistently observe the two double descent curves in the XOR datasets.


\subsection{MNIST}
The three-layer fully connected architecture used for MNIST is an extension of the standard model used for MNIST in the pruning literature. We varied the number of neurons $P \in \{3, 5, 10, 25, 50, 100, 300, 500, 1000, 5000, 10000\}$ in the first hidden layer. The size of the second hidden layer was kept fixed at 100. For the two-layer FCN, $P \in \{5, 10, 50, 100, 300, 500, 1000, 2000, 5000, 10000\}$ neurons were used in the hidden layer. For the fully connected networks, lottery ticket rewinding was used for networks with $P \geq 1000$. For ResNet-6, we varied the width $W \in \{1, 2, 5, 8, 11, 15, 20, 40, 80, 120\}$ of the convolutional filters to obtain networks of different sizes. Further details can be found in \cref{tab:model_details}.

\subsection{Fashion-MNIST}
We also performed a small set of experiments on the Fashion-MNIST dataset. We only performed a limited number of experiments focused on finding the effective number of parameters using overparameterised models. Since reproducing the double descent curve was not the target, we only used three-layer FCNs with $P \in \{500, 1000\}$ neurons in the first hidden layer. The size of the second hidden layer was kept fixed at 100. Note that more epochs (320) are needed to train a three-layer FCN on Fashion-MNIST.

\subsection{CIFAR-10}
We primarily considered ResNet-18 for CIFAR-10, where the width $W \in \{2, 5, 8, 11, 15, 20, 40, 60, 80, 100, 120, 150\}$ of the convolutional filters was varied to obtain networks of different sizes. Based on previous observations \citet{frankle2019stabilizing, he2022sparse}, we used 10 epochs of rewinding to consistently find lottery tickets in ResNet-18. To compare the effective number of parameters across different CNN architectures, we performed our analysis on DenseNet-121 and VGG-16. Further details can be found in \cref{tab:model_details}.

\begin{table}[!ht]
    \centering
    \caption{Neural network architectures used on real data. LR refers to the learning rate.}
    \label{tab:model_details}
    \begin{tabular}{l|lrrrrrr}
        Network & Dataset & Epochs & Batch size & Optimiser & Momentum & LR & Rewind Iter \\ \hline
        Two-layer FCN & MNIST & 120 & 128 & SGD & - & 0.1 & - \\
        Three-layer FCN & MNIST & 120 & 128 & SGD & - & 0.1 & - \\
        ResNet-6 & MNIST & 120 & 128 & SGD & 0.9 & 0.1 & - \\
        Three-layer FCN & Fashion MNIST & 320 & 128 & SGD & - & 0.1 & - \\
        ResNet-18 & CIFAR-10 & 160 & 128 & SGD & 0.9 & 0.1 & 10 epochs \\
        VGG-16 & CIFAR-10 & 160 & 128 & SGD & 0.9 & 0.1 & 10 epochs \\
        DenseNet-121 & CIFAR-10 & 160 & 128 & SGD & 0.9 & 0.1 & 10 epochs \\
    \end{tabular}
\end{table}

\section{Additional Results}
\Cref{fig:neff_comparison} shows how the effective number of parameters and test error of the best pruned models vary with $\nu$ for the linear and XOR datasets. As expected, the test error decreases with the distance between cluster means $\nu$. For networks trained on the linear dataset, we find that the effective number of parameters, i.e.~the number of parameters in the best pruned models, ranges from $\sim \num{200}$ to $\sim \num{500}$ for different starting models and for different values of $\nu$ (see \cref{fig:linear_neff} for the full distribution). We find the same behaviour for the XOR dataset, \cref{fig:XOR_ds}, but with a higher number of parameters (between $\num{250}$ and $\num{1000}$) than for the linear dataset with the same $\nu$.

\begin{figure}[!ht]
\begin{subfigure}{.48\textwidth}
    \centering
    \includegraphics[width=\linewidth]{linear_boxplots.pdf}
    \label{fig:linear_neff}
\end{subfigure} \hfill
\begin{subfigure}{.48\textwidth}
    \centering
    \includegraphics[width=\linewidth]{xor_boxplots.pdf}
    \caption{XOR datasets.}
    \label{fig:xor_neff}
\end{subfigure}
    \caption{Distribution of the effective number of parameters of the best pruned models (y-axis) as the distance between the clusters $\nu$ (x-axis) is varied for the linear and xor datasets. Only pruned models originating from overparameterised full models are considered. The numbers above the boxes report the test error of the model with median effective number of parameters for each $\nu$.}
    \label{fig:neff_comparison}
\end{figure}

Double descent and sparse double descent curves obtained for MNIST on two-layer FCN and ResNet-6 can be seen in \cref{fig:mnist_app}. \Cref{tab:summary_mnist} shows the number of parameters and test errors for the full/unpruned models and the corresponding best pruned models. Sparse double descent curves for two different models on Fashion-MNIST in \cref{fig:fmnist} show that, compared to MNIST, the test error for the best pruned models is higher while the effective number of parameters is approximately \num{10000}.

\begin{figure}[!ht]
    \centering
    \includegraphics[width=.48\linewidth]{sdd_mnist_n20.pdf} \hfill
    \includegraphics[width=.48\linewidth]{sdd_mnist_resent_n20.pdf}
    \caption{Pruning models along the double descent curve (dark red) show that sparse double descent curves (light red) from different models coincide at the minima. Results are shown for two-layer FCNs (left) and ResNet-6 (right) on MNIST with 20\% label noise averaged over three replicates.}
    \label{fig:mnist_app}
\end{figure}

\begin{figure}[!ht]
    \centering
    \includegraphics[width=.48\linewidth]{sdd_fmnist_n20.pdf}
    \caption{Estimating the effective number of parameters for Fashion-MNIST: Results are shown for three-layer FCNs on Fashion-MNIST with 20\% label noise averaged over three replicates. Compared to MNIST, the test error for the best pruned models is higher while the effective number of parameters is approximately \num{10000}.}
    \label{fig:fmnist}
\end{figure}

\begin{table}[!ht]
    \centering
    \caption{Number of parameters and test error for unpruned and best pruned models for two architectures trained on MNIST: two-layer FCNs, and ResNet-6. Average values over 3 replicates are reported. We observe that a $\num{200}\times$ increase for the full models results in only a $\sim\num{3.5}\times$ increase in the number of parameters for the best pruned models. Notice also that the error achieved by pruned models appears insensitive to the error rate of the original full model, i.e. even models with poor generalisation can be rescued by pruning.}
    \label{tab:summary_mnist}
    % \resizebox{\linewidth}{!}{
    \begin{tabular}{rr|rr||rr|rr}
       \multicolumn{4}{c||}{2 layer FC} & \multicolumn{4}{c}{ResNet-6} \\ \hline
       \multicolumn{2}{c|}{Parameters} & \multicolumn{2}{c||}{Test error} & \multicolumn{2}{c|}{Parameters} & \multicolumn{2}{c}{Test error} \\ \hline
       Full & Pruned & Full & Pruned & Full & Pruned & Full & Pruned \\ \hline
        \num{39700} & \num{8320} & \num{0.081} & \num{0.053} & \num{19464} & \num{5098} & \num{0.028} & \num{0.012} \\
        \num{79400} & \num{6814} & \num{0.131} & \num{0.077} & \num{36597} & \num{7669} & \num{0.069} & \num{0.012} \\
        \num{238200} & \num{5358} & \num{0.194} & \num{0.092} & \num{67785} & \num{7272} & \num{0.118} & \num{0.014} \\
        \num{794000} & \num{11436} & \num{0.122} & \num{0.106} & \num{120180} & \num{8252} & \num{0.116} & \num{0.013} \\ 
        \num{1588000} & \num{11710} & \num{0.094} & \num{0.095} & \num{478760} & \num{13470} & \num{0.080} & \num{0.013} \\
        \num{3970000} & \num{23428} & \num{0.076} & \num{0.103} & \num{1911120} & \num{17620} & \num{0.061} & \num{0.014} \\
        \num{7940000} & \num{29990} & \num{0.067} & \num{0.097} & \num{4297080} & \num{20286} & \num{0.057} & \num{0.013} \\
    \end{tabular}%}
\end{table}

\begin{table}[!ht]
    \caption{Effective number of parameters and test error of best pruned models on subsets of CIFAR-10 dataset.}
    \label{tab:cifar_sub}
    \centering
    \begin{tabular}{c|rr}
       Classes & Params & Error \\ \hline
        5 & \num{16312} & \num{0.102} \\
        6 & \num{20392} & \num{0.118} \\
        7 & \num{31869} & \num{0.122} \\
        8 & \num{31871} & \num{0.125} \\
        9 & \num{31872} & \num{0.126} \\
        10 & \num{39843} & \num{0.124} \\
    \end{tabular}
\end{table}

Finally, the calibration curves when label noise is added to the test set for MNIST and CIFAR-10 can be seen in \cref{fig:calib_real_noise}. The class-averaged calibration curves show that the pruned models are well-calibrated to noisy data while the full models are overconfident.

\begin{figure*}[!ht]
    \centering
    \begin{subfigure}{.49\linewidth}
        \includegraphics[width=\textwidth]{calib_mnist_h1_1000_noise.pdf}
        \caption{MNIST with two-layer FCN. ECE for pruned and full models are $0.075$ and $0.207$, respectively.} \label{fig:calib_mnist_noise}
    \end{subfigure}\hfill
    \begin{subfigure}{.49\linewidth}
        \includegraphics[width=\textwidth]{calib_cifar10_80_noise.pdf}
        \caption{CIFAR-10 with ResNet-18. ECE for pruned and full models are $0.053$ and $0.254$, respectively.} \label{fig:calib_cifar_noise}
    \end{subfigure}
    \caption{Class-averaged calibration curves for the best pruned and full models on (a) MNIST and (b) CIFAR10 datasets with noise added to the labels in the test set show that the pruned models are well-calibrated to noisy data while the full models are overconfident. The highlighted areas signify deviation between classes.}
    \label{fig:calib_real_noise}
\end{figure*}

\bibliography{arora_437-supp}
\end{document}
