% \documentclass{uai2023} % for initial submission
\documentclass[accepted]{fathullah_460} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like

%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
% Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
 % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams


% Maths equations
\usepackage{mathtools}
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amsthm}
\usepackage{commath}
\usepackage{bm}
\usepackage{physics}
\usepackage{multicol}
\usepackage{bbm}
\usepackage{scalerel}
\usepackage{comment}

% Formatting tables and figures
\usepackage{multirow, booktabs}
\usepackage[bf]{caption}
\setlength{\captionmargin}{10pt}
\usepackage{subcaption}
\usepackage{makecell}
\usepackage{graphicx}
\usepackage{stfloats}

\DeclarePairedDelimiterX{\infdivx}[2]{(}{)}{%
	#1\;\delimsize\|\;#2%
}
% Notation
\newcommand{\GP}{\mathcal{GP}}
\newcommand{\tP}{{\tt P}}
\newcommand{\tQ}{{\tt Q}}
\newcommand{\tp}{{\tt p}}
\newcommand{\tq}{{\tt q}}
\newcommand{\tpost}{{\tt p}(\bm\theta\vert\mathcal{D})}
\newcommand{\tprior}{{\tt p}(\bm\theta)}
\newcommand{\bhy}{\hat{\bm y}}
\newcommand{\hy}{\hat{y}}
\newcommand{\KL}{{\tt KL}\infdivx}
\newcommand{\Softmax}{{\tt Softmax}}
\newcommand{\LSoftmax}{{\tt LogSoftmax}}
\newcommand{\LogSumExp}{{\tt LogSumExp}}
\newcommand{\Dir}{{\tt Dir}}
\newcommand{\GLEU}{{\tt GLEU}}
\newcommand{\AUC}{{\tt AUC}}
\newcommand{\AUCRR}{{\tt AUC_{RR}}}

% Custom commands 
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\DeclareMathOperator*{\minimise}{\,minimise}
\newcommand{\tpm}{\tiny{$\pm$} }

\DeclarePairedDelimiter\ceil{\lceil}{\rceil}
\DeclarePairedDelimiter\floor{\lfloor}{\rfloor}
\providecommand{\tabularnewline}{\\}

\usepackage{pifont}
\newcommand{\cmark}{\ding{51}}%
\newcommand{\xmark}{\ding{55}}%


% for cross referencing the main text
% PLEASE ONLY USE xr IN THE SUPPLEMENTARY MATERIAL. 
% In the main paper, hard code any cross-reference to the supplementary material. 
\usepackage{xr} 
\externaldocument{uai2023-template}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Logit-Based Ensemble Distribution Distillation for \\Robust Autoregressive Sequence Uncertainties\\(Supplementary Material)}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<yf286@cam.ac.uk>?Subject=L-EDD UAI 2023}{Yassir~Fathullah}{}}
\author[2]{Guoxuan~Xia}
\author[1]{Mark~J.~F.~Gales}
% Add affiliations after the authors
\affil[1]{%
    Engineering Department\\
    University of Cambridge\\
    UK
}
\affil[2]{%
    Department of Electrical \& Electronic Engineering \\
    Imperial College London\\
    UK
}
  
\begin{document}
  
\onecolumn %% Turn this off if single column is desired for the supplement
\maketitle

\appendix

\section{Experimental Configuration}
\label{asec:experimental config}

This section will provide detailed information about the datasets used for training, development, evaluation and detection. It will also give the exact training and various hyperparameters used for all models.

\subsection{Datasets}

We utilise two training sets WMT16/20, each with a pair of development and evaluation datasets based on newstest13/14 and newstest19/20. Additionally, we utilise three out-of-domain datasets for evaluating detection performance of a wide range of transformer models, see Table \ref{tab:datasets}. As stated previously, all data is cleaned and tokenized using Moses\footnote{\url{github.com/moses-smt/mosesdecoder}}. For WMT16, a shared dictionary is learned using BPE with 32,000 merge operations. On WMT20 we learn disjoint dictionaries using BPE with 40,000 merge operations. A consequence of the larger disjoint dictionary on WMT20 is the significantly lower number of unknown tokens in the OOD datasets.

%
\begin{table*}[h!]
	\centering{}
	\begin{minipage}[t]{1.0\textwidth}%
		\begin{center}
			\caption{Dataset information together with average source and target sentence sizes post tokenization and processing. The OOD testsets Khresmoi, MTNT and KFTT have two quoted numbers for each field as they were processed using either the En-De WMT16 or En-Ru WMT20 BPE based dictionaries. Additionally, only source side information is provided for OOD sets as these are only used for unsupervised uncertainty estimation.}
			\vspace{-2mm}
			\def\arraystretch{1.08}
			\makebox[\textwidth][c]{
				\begin{tabular}{cc|c|cc|c}
					\toprule
    				\multirow{2}{*}{\textbf{Dataset}} & \multirow{2}{*}{\textbf{Type}} &
    				\textbf{Number of} & \multicolumn{2}{c|}{\textbf{Tokens per Sentence}} & \textbf{Fraction of Unknown} \\
    				& & \textbf{Sentences} & \textbf{Source} & \textbf{Target} & \textbf{Tokens in Source}\\
    				\midrule
    				En-De WMT16 & policy, news, web & 4.5M & 29.5 & 30.6 & 0.01\% \\
    				En-De newstest13 & \multirow{2}{*}{news} & 3.0K & 26.0 & 28.0 & 0.00\% \\
    				En-De newstest14 & & 3.0K & 27.6 & 29.1 & 0.00\% \\
    				\midrule
    				En-Ru WMT20 & policy, news, web & 58.4M & 27.8 & 27.5 & 0.00\% \\
    				En-Ru newstest19 & \multirow{2}{*}{news} & 2.0K & 29.9 & 33.4 & 0.00\% \\
    				En-Ru newstest20 & & 2.0K & 30.9 & 32.5 & 0.00\% \\
    				\midrule
    				Khresmoi & medical & 1.0K & 30.9/30.3 & --- & 0.78\%/0.00\% \\
    				MTNT & noisy reddit & 1.4K & 21.1/21.3 & --- & 0.45\%/0.06\% \\
    				KFTT & encyclopedia & 1.2K & 35.4/35.2 & --- & 1.46\%/0.01\% \\
					\bottomrule
			\end{tabular}}
			\label{tab:datasets}
		\end{center}
	\end{minipage}
\end{table*}
%

\subsection{En-De WMT16 Training}
\label{assec:wmt16}

We use the base transformer from \citep{aiayn} implemented in \texttt{fairseq} \citep{fairseq} and train it using 4 NVIDIA\textcopyright\hspace{0mm} A100 with an update frequency of 32. This is virtually equivalent to training on $4 \times 32 = 128$ GPUs. A per-gpu batch has a maximum of 3584 tokens. Models are optimized with Adam \citep{adam} using $\beta_1$ = 0.9, $\beta_2$ = 0.98, and $\epsilon$= 1e-8. We use a similar learning rate schedule to \citet{aiayn}, i.e., the learning rate increases linearly for 4000 warmup steps to a learning rate dependent on $d_{\tt model}$ after which it is decayed proportionally to the inverse square root of the number of steps:
%
\begin{equation*}
    \eta = ({\tt step} \cdot d_{\tt model})^{-0.5} \min \hspace{-0.8mm} \left(1, \frac{{\tt step}}{{\tt warmup}} \right)^{1.5}
\end{equation*}
%
We use label smoothing with 0.1 weight for the uniform prior distribution over the vocabulary. The last 10 weight checkpoints were averaged. Training was stopped after 31 epochs corresponding to approximately a total of 18 GPU-hours. At inference, a beam of 4 with a length-penalty of 0.6 is used for all models. The Deep Ensemble consists of 5 of such models.

\textbf{KD of Deep Ensemble}: Knowledge distilled models are first initialised by one of the teacher members and then trained using the knowledge distillation loss $\mathcal{L}_{\tt KD}$ provided in Section 2.2 with $\lambda = 0.50$. The student was trained with a warmup of 1026 steps (3 epochs), from $\eta = 4.0 \times 10^{-4}$ to $\eta = 7.0 \times 10^{-4}$ after which it decays for a total of 24 epochs. A temperature of $T = 0.8$ was used in the KL-divergence loss as this was found to be mildly beneficial. All other hyperparameters match the standard case above.

\textbf{Snapshot Ensemble}: The Snapshot Ensemble was generated by first starting from the last checkpoint of a standard trained transformer. At this point, a cyclic triangular learning rate schedule \citep{cyclic} was employed oscillating between the values of $\eta_{\tt min} = 1.0 \times 10^{-4}$ and $\eta_{\tt max} = 1.0 \times 10^{-3}$ with a period of 3 epochs. Note that the maximum learning rate in this cyclic phase is notably larger than the peak learning rate ($7.0 \times 10^{-4}$) during standard training  This setting was run for 15 epochs generating an ensemble with 5 members.

\textbf{KD of Snapshot Ensemble}: This system was trained using the same parameters as the Deep Ensemble distilled students but was however, trained for only 12 epochs since it converged faster.

\textbf{EDD \& L-EDD}: All of the EDD and L-EDD systems were distribution distilled from the Snapshot Ensemble using the same setup as "KD of Snapshot Ensemble". We chose $\beta = 0.10$ by evaluating the translation performance of a range of values $\beta \in \{0.05, 0.10, 0.20, 0.50\}$ on the development newstest-13 set, see Section 3.1.



\subsection{En-Ru WMT20 Training}
\label{assec:wmt20}

We use the big transformer from \citet{aiayn} again implemented in \texttt{fairseq} and trained using 4 NVIDIA\textcopyright\hspace{0mm} A100 with an update frequency of 32. A per-gpu batch has a maximum of 5120 tokens. Dropout was set to a value of 0.10 and weight decay to 0.0001. In this case we train the model for 20 epochs, corresponding to 53960 update steps and approximately 230 GPU-hours. The last 5 checkpoints were averaged leading to improved performance. At inference, a beam of 5 with a length-penalty of 1.0 is used for all models.

\textbf{Snapshot Ensemble}: Based on the last checkpoint of a standard trained big transformer, a triangular cyclic learning rate is utilised, oscillating between $\eta = 5.0 \times 10^{-5}$ and $\eta = 5.0 \times 10^{-4}$ every 2 epochs for 10 epochs. This results in an ensemble with 5 members. 

\textbf{KD of Snapshot Ensemble}: Similar to the previous section, the distillation student is initialised from its teacher but is trained using a learning rate warmup of 2698 steps (one epoch) from $\eta = 2.0 \times 10^{-4}$ to $\eta = 4.0 \times 10^{-4}$ after which it decays for a total of 12 epochs. The last 3 or 5 epochs are averaged, based on development newstest19 performance.

\textbf{L-EDD}: Following distillation, L-EDD (Laplace) models are trained using the same parameters. The best-found parameter $\beta = 0.10$ in the WMT'16 experiments is to be used here. No hyperparameter search is performed at this stage.


\clearpage
\section{Ablation Study: Ensemble Size}

Following the experimental setup in Section 5.1, we perform out-of-distribution detection of both Deep and Snapshot Ensembles of increasing size, see Table \ref{tab:wmt16-detection-ensemble}. This shows that increasing the size of an ensemble, regardless of its nature, does not improve its out-of-distribution detection notably.
%
\begin{table*}[h!]
	\centering{}
	\begin{minipage}[t]{1.0\textwidth}%
		\begin{center}
                \caption{OOD detection performance (\%{AUROC} $\uparrow$) for base transformer with ID dataset newtest-14 and OOD dataset Khresmoi. \textbf{Bold} indicates best in a column, \underline{underline} second best.}
			\vspace{-2mm}
			\def\arraystretch{1.00}
			\small
			\begin{tabular}{c|cc|cc}
				\toprule
                    \textbf{Ensemble} & 
                    \multicolumn{2}{c|}{\textbf{Deep Ensemble}} & 
                    \multicolumn{2}{c}{\textbf{Snapshot Ensemble}} \\
                    \textbf{Size} & \hspace{3mm}\textbf{TU}\hspace{3mm} & \hspace{3mm}\textbf{KU}\hspace{3mm} & \hspace{3mm}\textbf{TU}\hspace{3mm} & \hspace{3mm}\textbf{KU}\hspace{3mm} \\
                    \midrule
                    2 & 48.3 & 61.5 & 48.7 & 61.5 \\
                    3 & 48.2 & 61.7 & 48.8 & 61.9 \\
                    5 & 48.0 & 61.9 & 49.0 & 62.6 \\
                    7 & 48.0 & 61.9 & 49.0 & 62.6 \\
                    10 & 48.0 & 62.7 & 49.1 & 62.6 \\
			\bottomrule
			\end{tabular}
			\label{tab:wmt16-detection-ensemble}
		\end{center}
	\end{minipage}
\end{table*} 


\clearpage
\bibliography{fathullah_460}

\end{document}
