% \twocolumn[

% \aistatstitle{Learning to Rank for Active Learning via Multi-Task \\ Bilevel Optimization  (Supplementary Materials)}

% \thispagestyle{empty}

% \aistatsauthor{ Author 1 \And Author 2 \And  Author 3 }

% \aistatsaddress{ Institution 1 \And  Institution 2 \And Institution 3 } 
% ]

% \section{Appendix}
% \subsection{MNIST/FashionMNIST Model Architecture}
% \label{modelarchitecture}
% Here, we describe the model architecture discussed in Section \ref{experimental_setup}. Our architecture performs the following operations on an input image:

%\title{Title in Title Case\\(Supplementary Material)}
% \title{Supplementary Material}
% \maketitle

% \vspace{-2cm}
% 1. 2-D convolution on image
\section{Utility Model Architecture and choice of classifiers}
\label{UtilityModelArchitecture}
Here, we describe the acquisition network (Utility Model) discussed in Section \ref{experimental_setup}. 
\subsection{MNIST and FashionMNIST}
\label{FeatureExtractorMNIST&FashionMNIST}
Our architecture performs the following operations on pairs of subsets of images (Utility Samples) with equal size. We use below networks as feature extractor for pairs of raw embeddings of images. For each one within the pair:

1. 2-D Convolution on set of images.

2. 2-D Max Pool on output of (1).

3. ReLU on output of (2).

4. 2-D DropOut on output of (3).

5. 2-D Max Pool on output of (4).

6. ReLU on output of (5).

7. Fully-Connected Layer on Output of (6).

8. ReLU on output of (7).

9. 2-D DropOut on output of (8).

10. Fully-Connected Layer on output of (9).

11. ReLU on output of (10).


\subsection{CIFAR10 and SVHN}
\label{FeatureExtractorforCIFAR10&SVHN}
We use pretrained ResNet-18 on ImageNet as feature extractor and perform the following operations on pairs of subsets of extracted features for each image. For each one within the pair: 

1. Fully Connected Layer on set of feature embeddings.

2. ReLU on output of (1).

3. Fully Connected Layer on output of (2).


\subsection{Mutitask Set-based Neural Networks with RankNet}
\label{multitask_description}

After average pooling of output of (11) for MNIST and FashionMNIST in Section~\ref{FeatureExtractorMNIST&FashionMNIST} and output of (3) for CIFAR10 and SVHN in Section~\ref{FeatureExtractorforCIFAR10&SVHN},

for each one within the pair, we perform the following operations:

1. Fully Connected Layer on extracted features

2. ReLU on output of (1).

3. Fully Connected Layer on output of (2).

Denote the output of (3) as $\phi_{1}$ and $\phi_{2}$ for both the first one and second one in each pair of utility sample. 

For the prediction of probability score that which subset has larger utility value in the pair, we apply RankNet on $\phi_{1}$ and $\phi_{2}$ for pair comparison. The output score predicted by RankNet is the final probability score that we shall use to determine whether the first set has larger utility value than the second.

For the interpolation of utility value, we use $\phi_{1}$ and $\phi_{2}$ as embedding. For computing the distance between above two embeddings, we resort to the Euclidean distance.

For the prediction of optimal transport distance, we use MLP projection head for $\phi_{1}$ and $\phi_{2}$:

1. Fully Connected Layer on $\phi_{1}$ and $\phi_{2}$

2. ReLU on outputs of (1)

3. Fully-Connected Layer on outputs of (2).

We use the outputs of (3) as a supervision signal in designing the loss function for the utility model (see Definition \ref{otloss} in Section \ref{DUAL_MAX}).

We choose $\lambda_{1}, \lambda_{2}$ to be $0.5$ and $\lambda_{3}$ to be 1.

\subsection{Choice of classifiers}
The reason why we do not use ResNet-18 for MNIST type datasets is that ResNet-18 might be an overkill for MNIST and FashionMNIST as MNIST is a relatively simple dataset consisting of grayscale images of handwritten digits with a resolution of 28x28 pixels. ResNet18 is a complex architecture designed for much more challenging image recognition tasks. Moreover, due to its depth and complexity, ResNet18 has a lot more parameters compared to simpler models. Training such a large model on a 700 data points for MNIST could lead to overfitting with poor generalization. In fact, with 700 labeled MNIST data points, the ResNet-18 structure only achieves, on average, 78 $\%$ validation set accuracy.

\section{Supplemental Experimental Results}
\label{Full_Supplement_Experiment}
\subsection{Additional AL baselines}
\label{rest_baselines}
We also consider non-task-aware and representative deep AL baselines. For all experiments, we include a classical Margin Sampling algorithm, two recent active learning algorithms, BADGE and CoreSet, one learning-based algorithm, GLISTER, and random selection Random. 
\vspace{-1mm}

\textbf{Margin Sampling} \citep{roth2006margin}: Selects $B$ examples from $\Unlabeled_{0}$ with the smallest difference between the first and second most probable classes predicted by $f$. 
%We apply this baseline as a subroutine in stage 1 to reduce the search space of forward pass in utility model prediction.

\textbf{BADGE} \citep{ash2019deep}: A hyperparameter-free approach that trades between diversity and uncertainty using k-means$++$ in hallucinated gradient space.

\textbf{CoreSet} \citep{sener2017active}: A diversity-based approach using greedy approximation to the k-center problem on representations from the current classifier's penultimate layer.   %select $B$ examples by solving the k center on $\{ z_{x}: x \in \Unlabeled_{0} \}$ where $z_{x}$ is the penultimate layer 

\textbf{GLISTER} \citep{killamsetty2021glister}: A learning-based approach selecting $B$ instances from $\Unlabeled_{0}$ that would maximize the log-likelihood on held-out validation set $\LabeledSet_{val}$ by converting it as a mixed discrete-continuous bilevel optimization. We adopt the GLISTER-ONLINE version as an approximation for the inner optimization problem by taking a single gradient step update. 



\subsection{Validation Set size vs. validation accuracy}
\label{validationsize}
In addition to Fig\ref{fig:valsetsize}, we provide additional results on validation set accuracy w.r.t the size of the validation set (averaged across ten trials) in Table~\ref{table:validationsize}. These results demonstrate the robustness of validation accuracy as a consistent measure of utility value, with the standard error inside the parentheses generally decreasing as the validation set size increases. Even at smaller validation set sizes of 50 or 100—significantly less than the pretraining set size $\mathcal{S}_0=k$—the accuracy measures are comparable to those observed with larger sizes, such as 800 or 1000.
\begin{table*}[h!]
\centering
\begin{tabular}{@{}l*{7}{>{\centering\arraybackslash}p{1.5cm}}@{}}
\toprule
\textbf{Dataset\textbackslash validation size} & \textbf{50} & \textbf{100} & \textbf{200} & \textbf{400} & \textbf{600} & \textbf{800} & \textbf{1000} \\ \midrule
\textbf{SVHN} & 0.132(0.018) & 0.132(0.012) & 0.133(0.011) & 0.130(0.010) & 0.136(0.009) & 0.130(0.009) & 0.130(0.008) \\ 
\textbf{MNIST} & 0.758(0.007) & 0.757(0.005) & 0.763(0.001) & 0.764(0.003) & 0.760(0.009) & 0.770(0.009) & 0.764(0.008) \\ 
\textbf{FashionMNIST} & 0.699(0.023) & 0.714(0.015) & 0.707(0.011) & 0.712(0.009) & 0.712(0.009) & 0.707(0.009) & 0.712(0.009) \\ 
\textbf{CIFAR10} & 0.358(0.020) & 0.372(0.016) & 0.370(0.007) & 0.365(0.008) & 0.366(0.005) & 0.368(0.003) & 0.366(0.003) \\ 
\bottomrule
\end{tabular}
\caption{Validation performance across different datasets and validation sizes.}
\label{table:validationsize}
\end{table*}
\subsection{Additional results on clean data, noisy oracles, and class-imbalance settings}


\begin{figure*}[!hbt]
\centering
% \vspace{-2mm}
% \rotatebox[origin=c]{90}{\quad \quad \scriptsize Cumulative loss}
\begin{subfigure}{.242\textwidth}
        \centering
\includegraphics[clip,trim=0cm 0cm 0cm 0cm,width=\textwidth]{./fig/Appendix/FashionMNIST/Performance on FashionMNIST with Pretraining Budget 200 Full.png}
    \caption{\footnotesize FashionMNIST}
    % \vspace{-2mm}
    \label{Pretraining Budget Variation FashionMNIST}
     \end{subfigure} %\quad
     %\\
    \begin{subfigure}{.242\textwidth}
        \centering
        \includegraphics[clip,trim=0cm 0cm 0cm 0cm,width=\textwidth]{./fig/Appendix/MNIST/Performance on MNIST with Pretraining Budget 200 Full.png}
        % {\quad \quad \tiny Query cost}
        \caption{MNIST}\label{}
        % \vspace{-2mm}
    \end{subfigure}
    %\\
            \begin{subfigure}{.242\textwidth}
        \centering
\includegraphics[clip,trim=0cm 0cm 0cm 0cm,width=\textwidth]{./fig/Appendix/CIFAR10/Performance on CIFAR10 with Pretraining Budget 2500 Full.png}
        % {\quad \quad \tiny Query cost}
        \caption{CIFAR10}\label{}
        % \vspace{-2mm}
    \end{subfigure}%\hfil
    \quad
    \begin{subfigure}{.242\textwidth}
        \centering
        \includegraphics[clip,trim=0cm 0cm 0cm 0cm,width=\textwidth]{./fig/Appendix/SVHN/Performance on SVHN with Pretraining Budget 2500 Full.png}
        % {\quad \quad \tiny Query cost}
        \caption{SVHN}\label{}
        % \vspace{-2mm}
    \end{subfigure}%\hfil
    \caption{Clean data setting: Active learning validation performance with $B = 500$ for FashionMNIST and MNIST and $B = 5000$ for CIFAR10 and SVHN. Results are given in \%. The shaded area denotes standard error.} 
\label{Ideal Oracle Rest}
\end{figure*}

\begin{figure*}[!hbt]
\centering
% \vspace{-2mm}
% \rotatebox[origin=c]{90}{\quad \quad \scriptsize Cumulative loss}
\begin{subfigure}{.242\textwidth}
        \centering
\includegraphics[clip,trim=0cm 0cm 0cm 0cm,width=\textwidth]{../UAI2024/fig/Appendix/FashionMNIST/Performance on FashionMNIST with Pretraining Budget 200 noisy.png}
    \caption{\footnotesize FashionMNIST}
    % \vspace{-2mm}
    \label{Pretraining Budget Variation FashionMNIST}
     \end{subfigure} %\quad
     %\\
    \begin{subfigure}{.242\textwidth}
        \centering
        \includegraphics[clip,trim=0cm 0cm 0cm 0cm,width=\textwidth]{../UAI2024/fig/Appendix/MNIST/Performance on MNIST with Pretraining Budget 200 noisy.png}
        % {\quad \quad \tiny Query cost}
        \caption{MNIST}\label{}
        % \vspace{-2mm}
    \end{subfigure}
    %\\
            \begin{subfigure}{.242\textwidth}
        \centering
\includegraphics[clip,trim=0cm 0cm 0cm 0cm,width=\textwidth]{../uai2024-cameraready/fig/Appendix/CIFAR10/Performance on CIFAR10 with Pretraining Budget 2500 noisy.png}
        % {\quad \quad \tiny Query cost}
        \caption{CIFAR10}\label{}
        % \vspace{-2mm}
    \end{subfigure}%\hfil
    \quad
    \begin{subfigure}{.242\textwidth}
        \centering
        \includegraphics[clip,trim=0cm 0cm 0cm 0cm,width=\textwidth]{../UAI2024/fig/Appendix/SVHN/Performance on SVHN with Pretraining Budget 2500 noisy.png}
        % {\quad \quad \tiny Query cost}
        \caption{SVHN}\label{}
        % \vspace{-2mm}
    \end{subfigure}%\hfil
    \caption{Noisy oracles setting: Active learning validation performance with $B = 500$ for FashionMNIST and MNIST and $B = 5000$ for CIFAR10 and SVHN. Results are given in \%. The shaded area denotes standard error.} 
\label{Noisy Oracle Rest}
\end{figure*}

\begin{figure*}[!hbt]
\centering
% \vspace{-2mm}
% \rotatebox[origin=c]{90}{\quad \quad \scriptsize Cumulative loss}
\begin{subfigure}{.242\textwidth}
        \centering
\includegraphics[clip,trim=0cm 0cm 0cm 0cm,width=\textwidth]{../UAI2024/fig/Appendix/FashionMNIST/Performance on FashionMNIST with Pretraining Budget 200 imbalance.png}
    \caption{\footnotesize FashionMNIST}
    % \vspace{-2mm}
    \label{Pretraining Budget Variation FashionMNIST}
     \end{subfigure}%\quad
     %\\
    \begin{subfigure}{.242\textwidth}
        \centering
        \includegraphics[clip,trim=0cm 0cm 0cm 0cm,width=\textwidth]{../UAI2024/fig/Appendix/MNIST/Performance on MNIST with Pretraining Budget 200 imbalance.png}
        % {\quad \quad \tiny Query cost}
        \caption{MNIST}\label{}
        % \vspace{-2mm}
    \end{subfigure}
    %\\
        \begin{subfigure}{.242\textwidth}
        \centering
\includegraphics[clip,trim=0cm 0cm 0cm 0cm,width=\textwidth]{../uai2024-cameraready/fig/Appendix/CIFAR10/Performance on CIFAR10 with Pretraining Budget 2500 imbalance.png}
        % {\quad \quad \tiny Query cost}
        \caption{CIFAR10}\label{}
        % \vspace{-2mm}
    \end{subfigure}%\quad
    \begin{subfigure}{.242\textwidth}
        \centering
\includegraphics[clip,trim=0cm 0cm 0cm 0cm,width=\textwidth]{../UAI2024/fig/Appendix/SVHN/Performance on SVHN with Pretraining Budget 2500 imbalance.png}
        % {\quad \quad \tiny Query cost}
        \caption{SVHN}\label{}
        % \vspace{-2mm}
    \end{subfigure}%\hfil
    \caption{Class-imbalance setting: Active learning validation performance with $B = 500$ for FashionMNIST and MNIST and $B = 5000$ for CIFAR10 and SVHN. Results are given in \%. The shaded area denotes standard error.}
\label{Class Imbalanced Rest}
\end{figure*}


We follow the settings and present additional results, including both the learning-based, one-round AL baselines discussed in the main paper and the deep AL baselines described in Appendix~\ref{rest_baselines}. In particular, we focus on all scenarios, including the default clean data setting, the noisy oracle setting, and the class-imbalance setting for all benchmarks (MNIST, FashionMNIST, CIFAR10 and SVHN) in Figure~\ref{Ideal Oracle Rest}, Figure~\ref{Noisy Oracle Rest} and Figure~\ref{Class Imbalanced Rest}. 

\algname outperforms the rest of the baselines in most scenarios. Admittedly, for SVHN, LLAL \citep{yoo2019learning} outperforms the rest of the baselines (including \algname) by a large margin. Indeed, SVHN is an easy dataset with a large initial pool of $k$ and a labeling budget of $B$. Given $k = 2500$, as mentioned in \citet{hacohen2022active}, uncertainty plays a much more significant role than diversity when the labeling budget and initial labeled set are both important. Training a classifier with a reasonably accurate uncertainty estimate is feasible. Therefore, the specific design choice of LLAL \citep{yoo2019learning} to estimate the cross-entropy loss between pairs of unlabeled instances, another measure of uncertainty but with \textit{groundtruth} labels information incorporated in the loss prediction module, shall have superior empirical results in SVHN. Yet, we emphasize that LLAL is non-robust across different datasets. For instance, LLAL has mediocre performance in Figure~\ref{fig:imbalance} (a) and (b), much less than \algname. One similar argument could be FashionMNIST and MNIST are easy datasets and thus \citet{hacohen2022active} suggest that the acquisition function should focus on typical, easy and representative points.

Moreover, compared to \textsc{DULO}, \algname requires much fewer samples for training. Prior works either involve training millions of datamodels \citep{engstrom2024dsdm} or collecting thousands of samples \citep{wang2023one}, which would require considerable time before the deployment or acquisition stage. In contrast, ours requires only hundreds of utility samples to achieve a fair amount of accuracy improvement. This efficiency is achieved by imposing a strong regularization signal through the OT distance loss, and by reducing the regression task to a ranking problem.



\subsection{Size of Pretraining Set}
In the main paper, we have focused our evaluation on CIFAR-10. Here, we provide experiments to show the effectiveness of \algname on diverse datasets such as MNIST, FashionMNIST, and SVHN for single-round unlabeled data selection. We construct all the pretraining sets by random sampling from the whole training set of each dataset.

Figure~\ref{SizeofSeedSet} illustrates the impact of the size of the pretraining set on final validation set accuracy. One shall see \algname outperforms the rest of the baselines with most of the pretraining splits. The only performance degradation case of \algname could be SVHN where $k = 5500$ and $B=5000$. One possibility could be $k = 5500$ is suffice for BADGE to learn an accurate-enough gradient embedding space for single round selection. Therefore, BADGE could beat \algname when $k = 5500$ and $B = 5000$ for SVHN as the pretraining set is sufficiently large compared to the acquisition budget. Another interesting observation is that GLISTER often performs worse than most baselines for three datasets when the pretraining set has an extremely low budget, as $k = 100$ for FashionMNIST/MNIST and $k = 1500$ for SVHN. 
A plausible reason could be that a limited pretraining budget, combined with a substantial acquisition budget, might exacerbate the bias brought about by the single-step gradient approximation during the inner-level optimization phase, particularly when trying to maximize the log-likelihood of the training set. 

\subsection{Bilevel Training, OT Distance and RankNet}
For simplicity, the ``$\checkmark$'' for optimal transport denotes $\lambda_{\text{OT}} = 1$ and the ``$\times$'' 
%the crossmarks 
for RankNet represents regression-based utility model as stated in the main paper. 
% In particular, we only collect single utility sample and develop multitask learning framework on the single utility sample for regression-based utility model (Non-RankNet).  \yuxin{unclear what ``single utility sample'' means here. I commented out this sentence for now for the current submission} 
For ablating other network components, we still keep the same feature extractor explained in Section~\ref{FeatureExtractorMNIST&FashionMNIST} for MNIST and FashionMNIST and Section~\ref{FeatureExtractorforCIFAR10&SVHN} for CIFAR10 and SVHN. For the regression style acquisition function, we impose MLP head on the shared representation space $\phi$ for predicting validation accuracy with $\hat{u} = g(\phi) = W^{(2)}(\sigma(W^{(1)}))$ where $\sigma$ is a RELU activation function, very much similar to the description of predicting OT distance in Section~\ref{multitask_description}. For OT distance regularization, we adopt the same MLP projection head architecture described in Section~\ref{multitask_description}.




\begin{figure*}[t]
\label{SizeofSeedSet}
\centering
% \vspace{-2mm}
% \rotatebox[origin=c]{90}{\quad \quad \scriptsize Cumulative loss}
\begin{subfigure}{.33\textwidth}
        \centering
\includegraphics[clip,trim=0cm 0cm 0cm 0cm,width=\textwidth]{./fig/Appendix/Performance on FashionMNIST with acquisition Budget 500.png}
    \caption{\footnotesize FashionMNIST}
    % \vspace{-2mm}
    \label{Pretraining Budget Variation FashionMNIST}
     \end{subfigure}\hfil
     %\\
    \begin{subfigure}{.33\textwidth}
        \centering
        \includegraphics[clip,trim=0cm 0cm 0cm 0cm,width=\textwidth]{./fig/Appendix/Performance on MNIST with acquisition Budget 500.png}
        % {\quad \quad \tiny Query cost}
        \caption{MNIST}\label{}
        % \vspace{-2mm}
    \end{subfigure}\hfil
    %\\
    \begin{subfigure}{.33\textwidth}
        \centering
        \includegraphics[clip,trim=0cm 0cm 0cm 0cm,width=\textwidth]{./fig/Appendix/Performance on SVHN with acquisition Budget 5000.png}
        \caption{SVHN}\label{}
    \end{subfigure}%\hfil
    \caption{Experimental results. \textbf{(a-c)} Active learning validation performance with $B = 500$ for FashionMNIST and MNIST and $B = 5000$ for SVHN. Results are given in \%. Shaded area denotes standard error.} 
    \label{SizeofSeedSet}
\end{figure*}

\begin{table}[!h]
\centering
\small
\caption{Ablation study on three submodules with pretraining budget $k=200$ and acquisition budget $B=500$ for FashionMNIST. The last row %with each block written as - is 
corresponds to the random baseline.}
% \vspace{-2mm}
\label{BilevelTrainingFashionMNIST}
\scalebox{1.00}{
\begin{tabular}{lccc}
    \toprule
    Bilevel & Optimal Transport & RankNet & Accuracy\\
    \midrule
     $\checkmark$ & $\checkmark$ & $\checkmark$ & $\mathbf{83.1 \pm 0.1}$ \\
    $\checkmark$ & $\checkmark$ & $\times$ & $81.9 \pm 0.2$\\
    $\checkmark$ & $\times$ & $\checkmark$ & $81.2 \pm 0.4$\\
    $\checkmark$ & $\times$ & $\times$ & $81.8 \pm 0.2$ \\
    $\times$ & $\checkmark$ & $\checkmark$ & $81.0 \pm 0.3$\\
     $\times$ & $\checkmark$ & $\times$ & $81.7 \pm 0.2$ \\
    $\times$ & $\times$ & $\checkmark$ & $80.9 \pm 0.3$ \\
    $\times$ & $\times $ & $\times$ & $81.6 \pm 0.1$ \\
    - & - & - & $81.2 \pm 0.2$ \\
    \bottomrule
\end{tabular}}
% \vspace{-3mm}
\end{table}

\begin{table}[!h]
\centering
\small
\caption{Ablation study on three submodules with $k=200$ and $B=500$ for MNIST. The last row %with each block written as - is 
corresponds to the random baseline.}
% \vspace{-2mm}
\label{BilevelTrainingMNIST}
\scalebox{1.00}{
\begin{tabular}{lccc}
    \toprule
    Bilevel & Optimal Transport & RankNet & Accuracy\\
    \midrule
     $\checkmark$ & $\checkmark$ & $\checkmark$ & $\mathbf{95.3 \pm 0.2}$ \\
    $\checkmark$ & $\checkmark$ & $\times$ & $94.9 \pm 0.2$\\
    $\checkmark$ & $\times$ & $\checkmark$ & $95.0 \pm 0.1$\\
    $\checkmark$ & $\times$ & $\times$ & $94.8 \pm 0.2$ \\
    $\times$ & $\checkmark$ & $\checkmark$ & $94.6 \pm 0.1$\\
     $\times$ & $\checkmark$ & $\times$ & $94.9 \pm 0.1$ \\
    $\times$ & $\times$ & $\checkmark$ & $95.0 \pm 0.2$ \\
    $\times$ & $\times $ & $\times$ & $94.8 \pm 0.2$ \\
    - & - & - & $93.4 \pm 0.1$ \\
    \bottomrule
\end{tabular}}
% \vspace{-3mm}
\end{table}

\begin{table}[!h]
\centering
\small
\caption{Ablation study on three submodules with $k=3500$ and $B=5000$ for SVHN. The last row %with each block written as - is 
corresponds to the random baseline.}
% \vspace{-2mm}
\label{BilevelTrainingSVHN}
\scalebox{1.00}{
\begin{tabular}{lccc}
    \toprule
    Bilevel & Optimal Transport & RankNet & Accuracy\\
    \midrule
     $\checkmark$ & $\checkmark$ & $\checkmark$ & $\mathbf{88.1 \pm 0.3}$ \\
    $\checkmark$ & $\checkmark$ & $\times$ & $86.7 \pm 0.2$\\
    $\checkmark$ & $\times$ & $\checkmark$ & $87.8 \pm 0.3$ \\
    $\checkmark$ & $\times$ & $\times$ & $86.5 \pm 0.3$ \\
    $\times$ & $\checkmark$ & $\checkmark$ & $87.8 \pm 0.2$\\
     $\times$ & $\checkmark$ & $\times$ & $86.3 \pm 0.1$ \\
    $\times$ & $\times$ & $\checkmark$ & $87.5 \pm 0.1$ \\
    $\times$ & $\times $ & $\times$ & $86.1 \pm 0.2$ \\
    - & - & - & $86.5 \pm 0.3$ \\
    \bottomrule
\end{tabular}}
% \vspace{-3mm}
\end{table}



\begin{figure*}[!h]
    \begin{subfigure}{.33\textwidth}
        \centering
\includegraphics[clip,trim=0cm 0cm 0cm 0cm,width=\textwidth]{./fig/Appendix/performance_vs_lambda_OT_fashion_mnist.png}
    \caption{\footnotesize FashionMNIST}
    % \vspace{-2mm}
    \label{Pretraining Budget Variation FashionMNIST}
     \end{subfigure}%\hfil
     %\\
    \begin{subfigure}{.33\textwidth}
        \centering
        \includegraphics[clip,trim=0cm 0cm 0cm 0cm,width=\textwidth]{./fig/Appendix/performance_vs_lambda_OT_mnist.png}
        \caption{MNIST}\label{}
    \end{subfigure}\hfil
    %\\
    \begin{subfigure}{.33\textwidth}
        \centering
        \includegraphics[clip,trim=0cm 0cm 0cm 0cm,width=\textwidth]{./fig/Appendix/performance_vs_lambda_OT_svhn.png}
        % {\quad \quad \tiny Query cost}
        \caption{SVHN}\label{}
        % \vspace{-2mm}
    \end{subfigure}\hfil
    \caption{Ablation on $\lambda_{\text{OT}}$ across different acquisition budget. \textbf{(a-b)} $k = 200$; \textbf{(c)} $k = 2500$ }.
    \label{hyperparam_ot}
\end{figure*}



To prove the efficacy of synergizing three seemingly irrelevant submodules together, we provide ablation study of three submodules for the rest of three datasets. Table~\ref{BilevelTrainingFashionMNIST}, ~\ref{BilevelTrainingMNIST} and \ref{BilevelTrainingSVHN} show the impact of turning off each submodule on the final validation set accuracy for FashionMNIST, MNIST and SVHN respectively.

While the main premise of combining three submodules to improve validation set performance, it is natural to evaluate the significance of each submodule plays the role in utility model training. For FashionMNIST, one might see bilevel training plays a signicant role in obtaining good validation set accuracy as the top three combinations of submodules all have bilevel training turned on (Table~\ref{BilevelTrainingFashionMNIST}). For MNIST, the gain of validation set accuracy for each design choice shall be subtle to differentiate under various settings as deterministic CNNs can easily achieve 96\% accuracy under simpler acquisition heuristics ,i.e., least confidence or max entropy \citep{gal2017deep}. Still, we shall see the marginal improvement with all three submodules turned on (Table~\ref{BilevelTrainingMNIST}) and the top three combinations of submodules all have RankNet turned on which validates the necessity of pairwise ranking in the design choice of our utility model. For complex datasets as SVHN, RankNet plays a crucial role in improving classification performance as expected, demonstrated by the top three scores of accuracy all have RankNet turned on (Table~\ref{BilevelTrainingSVHN}).


% replace RankNet in Section~\ref{multitask_description} with fully connected layer on \textbf{SharedRepresentation1} and \textbf{SharedRepresentation2} for pair comparison. The output score predicted by the linear layer  



\begin{table*}[!ht]
\centering
\begin{tabular}{@{}l*{6}{>{\centering\arraybackslash}p{1.5cm}}@{}}
\toprule
\textbf{Labeling Budget \textbackslash $\lambda_{OT}$} & \textbf{0} & \textbf{10e-2} & \textbf{10e-1} & \textbf{1} & \textbf{10} & \textbf{100} \\ \midrule
\textbf{3000} & 0.714(0.002) & 0.718(0.006) & 0.728(0.003) & 0.721(0.002) & 0.724(0.003) & 0.718(0.005) \\ 
\textbf{5000} & 0.752(0.003) & 0.761(0.004) & 0.760(0.003) & 0.772(0.002) & 0.755(0.003) & 0.757(0.004) \\ 
\textbf{7000} & 0.771(0.003) & 0.774(0.004) & 0.779(0.005) & 0.778(0.003) & 0.779(0.003) & 0.788(0.009) \\ 
\textbf{10000} & 0.793(0.003) & 0.800(0.003) & 0.801(0.004) & 0.809(0.002) & 0.803(0.002) & 0.816(0.005) \\ 
\bottomrule
\end{tabular}
\caption{Labeling budget vs. $\lambda_{OT}$ performance. Values in parentheses indicate standard error.}
\label{cifar10_lambda_OT}
\end{table*}

\subsection{Hyperparameter Tuning for OT Distance}
\label{HyperparameterTuning_OTDistance}
Figure~\ref{Accuracy Validation Performance}(f) in Section~\ref{Experiments} illustrates the benefits of incorporating optimal transport distance into the loss structure of our utility model. Figure~\ref{hyperparam_ot} shall serve as a complement to uncover the usefulness of optimal transport distance, regardless of the scale of $\lambda_{\text{OT}}$, for various datasets of interest. Regardless of datasets and classification networks architecture, the incorporation of optimal transport distance finds utility in reducing generalization error, measured by the increase of validation set accuracy. Even though $\lambda_{\text{OT}}$ can be a hard hyperparameter for fine-tuning, either Figure~\ref{Accuracy Validation Performance}(f) and Figure~\ref{hyperparam_ot} suggest final validation set accuracy for $\lambda_{\text{OT}} \neq 0$ is higher than its counterpart for $\lambda_{\text{OT}} = 0$.

For additional results on CIFAR10 with $\lambda_{OT}$ spanning more orders of magnitude, please see Table~\ref{cifar10_lambda_OT}.


\subsection{Runtime %Wall Clock Time 
Analysis}
All models are trained using NVIDIA A40 GPU with 48GB. To increase the running speed of our experiments, we use data parallelism on multiple GPUs in implementations. The time recorded below is for Pytorch training with 2 GPUs. As stated in main text, all the experiments are repeated for 10 trials to reduce the training stochasticity. We fix $k = 2500$ and $B = 5000$ for CIFAR10 and SVHN with $n = 30$ utility samples collected per batch with $\tau_{1} = 2$, $b = 1000$ and $k_{1} = 500$ for pretraining stage with each batch trained for 20 epochs. For CIFAR10, we collect 500 pairs of utility samples for $\hat{u}$ offline training with roughly 3 hours and 20 minutes. Then, the total training time for both pretraining and acquisition stage is 1 hour and 20 minutes with pretraining stage 50 minutes and acquisition stage 30 minutes. For SVHN, we collect 500 pairs of utility samples for offline training with roughly 1 hour and 40 minutes. Then, the total training time for both pretraining and acquisition stage is roughly 1 hour with pretraining stage 29 minutes and acquisition stage 34 minutes.

We fix $k = 200$ and $B = 500$ for MNIST and FashionMNIST with $n = 50$ utility samples collected per batch with $\tau_{1} = 3$, $b = 50$ and $k_{1} = 50$ for pretraining stage with each batch trained for 20 epochs. For MNIST, we first randomly collect 500 pairs of utility samples and the total training time for utility model $\hat{u}$ for offline training is 50 minutes. Then, the total training time for both pretraining and acquisition stage is 59 minutes with pretraining stage 39 minutes and acquisition stage 20 minutes. For FashionMNIST, the offline training for utility model $\hat{u}$ is 50 minutes for 500 pairs of utility samples. The total training time for both pretraining and acquisition stage is 50 minutes with pretraining stage 32 minutes and acquisition stage 18 minutes.

For learning-based acquisition function LLAL \citep{yoo2019learning}, the training time for loss prediction module for CIFAR10 with $k = 2500$ and $B = 5000$ is 20 minutes and acquisition time is 50 minutes and SVHN with $k = 2500$ and $B = 5000$ is 5 minutes and acquisition time is 30 minutes. For Margin, GLISTER, random, BADGE and CoreSet applied on CIFAR10 and SVHN, the training time for pretraining set is roughly 5 minutes and acquisition stage is 14 minutes, 20 minutes, 10 minutes, 20 minutes and 10 minutes. For those four methods applied on FashionMNIST and MNIST, the training time for pretraining and acquisition stage are negligible.
