%!TEX root = ./main.tex
% \vspace{-3mm}
\section{Methodology} \label{DUAL_MAX}
We introduce our algorithm, \algname, following the two-stage learning protocol described previously. 
In a nutshell, \algname (1) collects training samples for \textit{pretraining} utility model, and (2) greedily selects the batch with the maximal estimated utility value from one to total batches $t$ in the acquisition stage. We divide the pretraining stage into $\tau_{1}$ iterations and the acquisition stage into $\tau_{2}$ iterations with mini-batch size $b$ for each iteration. More precisely, 
we instantiate \algname into the following building blocks: 
\begin{enumerate}[label=\alph*)]\denselist
    \item Develop a set-based multitask neural network model $\hat{u}$ as a surrogate model for pertaining; 
    \item Define the loss function for the utility model $\hat{u}$; 
    \item Sample a collection of subsets $\{(\utilitysample, u(\utilitysample))\}_{i} \subseteq \LabeledSet_{0}$ where $i \in [1, \tau_{1}]$ as a growing labeled set up to $\LabeledSet_{0}$ for training $\hat{u}$; 
    \item Update the set based model $\hat{u}$ per iteration of the pretraining stage; 
    \item Greedily follow the learned utility model $\hat{u}$ in the acquisition stage.
\end{enumerate}

\subsection{A Two-Stage Active Learning Framework}\label{sec:framework}
So far, we have defined the framework and will unravel a)-d) above to discuss each relevant aspect respectively:

\textbf{a) What surrogate models $\hat{u}$ should we use?} \label{surrogatemodel}
Similar to \cite{ilyas2022datamodels}, by parametrizing a surrogate model with training samples, we transform the surrogate model construction into a supervised learning task (See Definition \ref{UtilityModel}). 
In our context, the training samples $\utilitysample$ are subsets of pretraining set $\LabeledSet_{0}$ and the utility value is $u(\utilitysample)$. Throughout this work, we refer to the pairs $(\utilitysample, u(\utilitysample))$ as \textit{utility samples}. It is appealing to adopt their linearity assumption into AL setting due to strong theoretical footing \citep{saunshi2022understanding} and simplicity in model architectures. Nevertheless, to avoid extensive sample collection and model retraining \citep{ilyas2022datamodels}, we hence prefer more complex architectures for modeling the interaction between elements within each utility sample. One natural candidate for $\hat{u}$ is set-based neural networks due to their strong expressive power (i.e., Set Transformer \citep{lee2019set} or Deep Sets \citep{zaheer2017deep}). Denote the general set-based neural network(NN) as 
\begin{align}
\label{DeepSets}
    \text{net}(\utilitysample) &=\text{net}(x_{1}, ..., x_{a}) %\nonumber \\
    %&
    = \rho(\text{pool}(\{\phi(x_{1}), ... \phi(x_{a})\}) \nonumber
\end{align}
where $\{ x_{i} \}_{i = 1}^{a}$ represents a single utility sample $\utilitysample$ with size $a$ and $\phi, \rho$ is the feature extractor and regressor for the set-based NN itself.

In experiments, we find set-based NN shall serve as a primitive for utility models, but still, it lacks principled supervision signals for model training. \citet{engstrom2024dsdm} train millions of cheap datamodels \citep{ilyas2022datamodels} in the hope of better generalization for unseen tasks, while in our setting, we shall not afford large-scale training due to computational infeasibility and aim to obtain good utility model with hundreds of samples for faster deployment. Therefore, we need a more fine-grained signal that would tie labeled data and validation set.  %\textit{in principle}. 
In particular, \citet{alvarez2020geometric} introduce the notion of geometric distance via optimal transport (OT) between two datasets and \citet{just2023lava} extend it as a learning-agnostic proxy for measuring model performance on $\LabeledSet_{val}$. The celebrated success of OT distance in predicting validation set accuracy \citep{just2023lava} enables us to cast the groundtruth OT distance between utility samples and validation set \citep{alvarez2020geometric} as a supervision signal for the utility model.
\begin{definition}[Surrogate Utility Model]\label{UtilityModel}
    Let $\mathcal{X}$ be the instance domain, and $\utilitysample$ be any sampled subset drawn from the distribution $\mathcal{D}$ over $\mathcal{X}$. A \textit{surrogate utility model} %$\hat{u}: 2^\mathcal{X} \rightarrow \mathbb{R}$ 
    $\hat{u} (\utilitysample)$ 
    is a set function mapping from $2^\mathcal{X} \rightarrow \mathbb{R}$, 
    %Let $\hat{\theta}_{S'} = \mathcal{A}(S')$ be the model trained on $S'$ using $\mathcal{A}$. 
    % The proposed surrogate utility model is a parametric function $\hat{A}$ 
    optimized to predict the true utility $u(\utilitysample)$ on a training set $\utilitysample \sim \mathcal{D}$:
    \begin{align} \hat{u}=\argmin_{\tilde{u}_w}\mathbb{\hat{E}}_{\utilitysample \sim \mathcal{D}}[\mathcal{L}(\tilde{u}_w(\utilitysample), u(\utilitysample)]
    \end{align}
where $\mathcal{L}(\cdot, \cdot)$ denotes the loss function, and $\tilde{u}_w$ is a parametric set function to approximate $u$.
\end{definition}
\vspace{-1mm}
\textbf{b) What loss function should we minimize?} 
One natural choice is to minimize the MSE (mean square error) of estimated and true utility value as $\mathcal{L} = (\hat{u} - u)^{2}$. Yet, the evaluation of validation accuracy is non-deterministic (thus stochastic) due to the aleatoric uncertainty of the classifier itself \citep{park2023trak}.
While the simplistic way is to train a neural network to approximate the utility value in a regression fashion and minimize the MSE, we fail to learn a good utility model by regressing validation accuracy on a set of utility samples (See Section~\ref{Three Design Choices} for ablation study on casting utility model as regression). 
An alternative for loss function lies in the idea of pairwise ranking, simplifying regression problem to ranking problem. \cite{yoo2019learning} introduce a loss prediction module to predict the classifier loss on a single data point and handicraft the loss function for predicting the classifier loss in a pairwise ranking fashion. For minibatch samples with size $d$, \citet{yoo2019learning} divide it into $d/2$ pairs and rank the differences between each pair of predicted and groundtruth losses to discard the overall loss scale. Extending the idea of ranking classification loss between pairs of instances to rank the utility value, we incorporate the classical RankNet \citep{burges2005learning} structure to rank between pairs of equal-size utility samples with OT distance as a regularizer in the final loss.
\begin{definition}[Ranking Loss]
    Given $\mathcal{X}$, and let $\utilitysample_{1}, \utilitysample_{2}$ be two sampled subset drawn from distribution $\mathcal{D}$ over $\mathcal{X}$ with equal size $d$. Denote the utility value (validation accuracy) of $\utilitysample_{1}$ as $u_{1}$ and the utility value of $\utilitysample_{2}$ as $u_{2}$. W.l.o.g. suppose $u_{1} > u_{2}$, $u_{12} = u_{1} - u_{2}$. Specifically, $u_{1} > u_{2}$ is taken to mean that the surrogate utility model $\hat{u}$ asserts that $\utilitysample_{1} \rhd \utilitysample_{2}$. Denote the modeled posterior $P(u_{1} \rhd u_{2})$ by $P_{12}$, and let $\bar{P}_{12}$ be the desired target values for those posteriors. The Binary Cross Entropy (BCE) loss for pair $(\utilitysample_{1}, \utilitysample_{2})$ is written as %\yuxin{removed bottom half} 
    \begin{align}
\mathcal{L}_{\text{Rank}} %&= \mathcal{L}({u_{12}}) \nonumber \\&
        = - \bar{P}_{12}\log P_{12} - (1 - \bar{P}_{12}) \log(1 - P_{12}).\nonumber
    \end{align}
\end{definition}

With this metric in hand, we shall guide $\hat{u}$ to learn the principled signal ties to validation set accuracy and ignore the shifting distribution between labeled data and $\LabeledSet_{val}$ in the acquisition stage (See Definition~\ref{otloss}). Even though OT distance can be approximated in near-linear time complexity \citep{altschuler2017near}, our goal is to estimate which subset of training data yields the highest validation accuracy rather than approximating OT distance itself. To circumvent the computational infeasibility, we leverage the OT distance as a supervision signal to regularize $\hat{u}$ rather than serving as an input to $\hat{u}$. We show the efficacy of incorporating OT Distance Loss in Section~\ref{Three Design Choices}.
% \vspace{-1mm}
\begin{definition}[OT Distance Loss]
\label{otloss}
    Given two utility samples $\utilitysample_{1}$, $\utilitysample_{2}$. Denote the corresponding ground truth OT distance values by $OT_{1}$, $OT_{2}$, and the predicted OT distance values as $\hat{OT}_{1}$ and $\hat{OT}_{2}$. The OT distance loss is defined as 
    \begin{align}
\mathcal{L}_{\text{OT}} = & \lambda_{1} (\hat{OT}_{1} - OT_{1})^{2} + \lambda_{2} (\hat{OT}_{2} - OT_{2})^{2} \nonumber \\
    & - \lambda_{3} (\min(\hat{OT}_{1}, 0) + \min(\hat{OT}_{2}, 0)) \nonumber
\end{align}
where $\lambda_{1}, \lambda_{2}, \lambda_{3}$ are hyperparameters. 
\end{definition} 
Here, the first two terms are mean squared error for OT distances and the third terms are positive constraints. Intuitively, the OT distance loss specifies the penalty for mispredicting the OT distance values of utility samples $\utilitysample_{1}$ and $\utilitysample_{2}$. Combining the ranking loss and the OT distance loss, we obtain the loss function for RAMBO over pairs of utility samples:
% \vspace{-2mm}
\begin{definition}[Total Loss for Utility Model]
\label{total_loss}
Given two utility samples $\utilitysample_{1}$, $\utilitysample_{2}$. The total loss over $(\utilitysample_{1},\utilitysample_{2})$ is defined as 
    \begin{align}
        \mathcal{L}_{\text{Total}} =\mathcal{L}_{\text{Rank}} + \lambda_{\text{OT}} \cdot \mathcal{L}_{\text{OT}}
    \end{align}
where $\lambda_{\text{OT}}$ is a hyperparameter. %  for tuning.
\end{definition}
% \vspace{-2mm}
\textbf{c) How do we collect utility samples iteratively?}
The very first question encountered during pretraining is how to generate utility samples. \citet{ilyas2022datamodels} construct training subsets by random sampling a fixed-length subset. 
One caveat in our setting is the growing length of labeled sets as the progression of the active learner. To enable the model to adapt to the growing length of utility samples, one needs to incorporate \textit{diversity} in the size of $\utilitysample$. One natural choice is to perform rejection sampling from the \textit{powerset} of $\LabeledSet_{0}$, i.e., $\utilitysample \sim 2^{\LabeledSet_{0}}$. Instead of fixing the sampling proportion, 
we propose to fix the number of utility samples collected from $\LabeledSet_{0}$ per iteration during pretraining as $n$.

\textbf{d) How do we update the set-based NN during pretraining?}
As mentioned in Section \ref{surrogatemodel}, the length of labeled utility samples grows, and random split for training and validation set may fail to capture the notion of generalizability in neural batch active learning. The goal of the utility model is to \textit{generalize} to the longer length of utility samples and learn a general mapping from utility sample to validation accuracy. Inspired by bilevel training work \citep{franceschi2018bilevel, grazzi2020iteration, borsos2021semi}, we employ a bilevel framework to separate the utility samples by \textit{length}. In practice, we separate the validation set and training set by $50\%$ and $50\%$ for simplicity.  We retrain the set-based NN per iteration with the accumulation of utility samples per iteration. We defer the complete discussion of bi-level training to Section~\ref{bilevel_opt}.

\textbf{e) How do we acquire data in the acquisition stage?} 
In the context of utility maximization, perhaps the simplest candidate is to select the instance with the largest predicted utility. Popular approaches rely on sequentially picking one data point per round \citep{houlsby2011bayesian, gal2017deep} though the addition of a single data point causes minimal change to validation accuracy while increasing the cost of model retraining. \citet{alieva2020learning} suggest that for many sequential decision making problems, greedy heuristics for sequentially selecting actions exhibit superior performance 
without invoking expensive evaluation oracles. Recall that one shall interpret $\hat{u}$ as a score-based acquisition function and leverage it for sequential decision making, i.e. to greedily select unlabeled data with the highest predicted utility. Inspired by \citet{citovsky2021batch}, we employ Margin Sampling \citep{roth2006margin} as a filter for unlabeled instances i.e., select $M$ unlabeled instances with lowest margin scores, per iteration in the acquisition stage (See Algorithm~\ref{alg:Greedy-Margin}). We propose to randomly split $\Unlabeled_{0}$ into batches of size $b$, concatenate each batch to the current labeled pool, and then use the concatenated batch as input to $\hat{u}$ for %forward pass 
utility prediction. We perform sequential batch selection within the acquisition stage and select the unlabeled batch with the largest predicted score.
% \vspace{-1mm}
\subsection{The RAMBO Algorithm}
% \vspace{-2mm}
The essence of our two-stage utility model aligns with Shakespeare’s famous line from The Tempest, “What’s past is prologue.” 
Our overarching motivation is to train an acquisition function on past utility samples that generalize well to utility samples of longer history. 
We initialize the utility model by collecting and training samples from offline datasets, providing an initial estimate of the \textit{feature extractor} $\phi_{0}$. 
This initial feature extractor $\phi_{0}(\cdot)$ can serve as a warm start for non-adaptive batch selection in the acquisition stage. 
We emphasize the need for this \textit{initialization} step as \algname designed for single-round acquisition. 
% \vspace{-1mm}
\begin{algorithm}[t]
\caption{\algname}
\begin{algorithmic}[1]
\State {\bf Input}: 
%Budget 
$B$, 
%Unlabeled set 
$\Unlabeled_{0}$, 
% Pretrained utility model $\hat{u}_{0}$, 
%Initial labeled set with size $k$ examples as 
%Pretraining Set 
$\LabeledSet_{0}$ 
%with size $k$, 
%Ground set 
$\mathcal{X}$, %$V = \LabeledSet_{0} \cup \Unlabeled_{0}$.
%Non-adaptive batch $b$ per iteration,
%mini-batch size 
$b$, 
%margin size 
$M$, 
$n$, %utility samples per iteration, 
% classifier $f$, 
%collection of utility samples $D_{0}$, 
$\LabeledSet_{val}$.
\State {\bf Output}: $\LabeledSet_{1}$
%\State Randomly sample $\utilitysample \subseteq \mathcal{X} \setminus (\LabeledSet_{0} \cup \Unlabeled_{0}) $ and obtain utility samples ${(\utilitysample, u(\utilitysample))}$. %${(\utilitysample, u(\utilitysample))}$ from $\mathcal{X} \setminus (\LabeledSet_{0} \cup \Unlabeled_{0})$ 
%\yuxinil{notations on labeled/unlabeled subsets still problematic}
\State Initialize $(\hat{u}_{0}, \phi_{0})$ from offline dataset %\leftarrow$ %\text{Pretrain} \{(\utilitysample, u(\utilitysample))\}$. 
\State Randomly divide $\LabeledSet_{0}$ with size $k$ into $S_{0}$ with size $k_{1}$ and $\{s_{1}, s_{2} ... s_{\tau_{1}}\}$ with each size  
$b$ and set $U_{0} = \Unlabeled_{0}$
%$\lfloor \frac{k-k_{1}}{b} \rfloor$
\State $\tau_{1} = \frac{k-k_{1}}{b}$ and $\tau_{2} = \frac{B}{b}$
\State Train $f$ on $S_{0}$ and get accuracy on $\LabeledSet_{val}$ as $acc_{0}$
\State $\mathcal{D}_{0} \leftarrow \{\}$
%\State Train classifier $f$ on $s_{seed}$ and obtain validation set $\LabeledSet_{val}$ as $acc_{0}$
\For{$i = 0: \tau_{1}$} 
\Comment{\textbf{Pretraining}}
    \State $S_{i+1} \leftarrow S_{i} \cup \{ s_{i+1} \}$
    \State Train $f$ on $S_{i+1}$
    \State Obtain accuracy on $\LabeledSet_{val}$ as $acc_{i+1}$
    \State $D_{i+1} \leftarrow$ Utility-Samples-Augmentation($S_{i},$ \\
$S_{i+1}, n, acc_{i}, acc_{i+1}, D_{i}$) 
    \State Train $\hat{u}_{i}$ from $D_{i+1}$ %\leftarrow \hat{u}_{i-1}(D_{i+1})$
    \Comment{\textbf{Bilevel Optimization}}
\EndFor
\For {$j = 0: \tau_{2}$} \Comment{\textbf{Acquisition}}
    \State $S_{j+1}, U_{j+1} \leftarrow \text{Greedy-Margin}(\hat{u}_{\tau_{1}}, j, b, S_{j}, M, U_{j})$
    % \State $S_{t} \leftarrow \argmax_{S \in \Unlabeled_{t}} \hat{u}(s | S_{t - 1})
\EndFor
\State $\LabeledSet_{1}, \Unlabeled_{1} = S_{\tau_{2}}, U_{\tau_{2}}$ 
% \State $\Unlabeled_{\tau_{2}} = \Unlabeled_{1}$
\end{algorithmic}
\end{algorithm}
\begin{algorithm}[t]
    \caption{Greedy-Margin}
\label{alg:Greedy-Margin}
    \begin{algorithmic}[1]
        \State {\bf Input}: $\hat{u}$, $j$, $b$, $S_{j}$, $M$, $U_{j}$.
        \State {\bf Output}: $S_{j+1}$, $U_{j + 1}$
        %\State {\bf Output}: A set of data points $S_{j} \subseteq U_{j}$ satisfying $|S_{j}| \leq b$
        % \For{($t \leftarrow \tau; t \leq T, t \leftarrow i + 1$)}
            \State $R \rightarrow$ a subset obtained by smallest margin scores $M$ examples from $U_{j} \setminus S_{j}$
            \State Randomly divide $R$ into $\{ \lfloor \frac{R}{b} \rfloor \}$ batches of subsets $\{ (x_{i})_{i=1}^{b} \}$.
            \State $b_{\max} \leftarrow \argmax_{ \{ (x_{i})_{i=1}^{b} \} \in \{ \lfloor \frac{R}{b} \rfloor \}} \hat{u}(S_{j} \cup (x_{i})_{i = 1}^{b})$
            \State $S_{j + 1} \leftarrow S_{j} \cup \{ b_{\max}\}$
            \State $U_{j+1} \leftarrow U_{j} \setminus \{ b_{\max} \}$
        % \EndFor
    \end{algorithmic}
\end{algorithm}
\begin{algorithm}[t]
    \caption{Utility-Samples-Augmentation}
    \label{alg:interpolate}
    \begin{algorithmic}[1]
        \State {\bf Input}: $S_i$, $S_{i+1}$, $n$, $acc_{i}$, $acc_{i+1}$, $D_{i}$.
        % $S_{t-1}, S_{t}, u(S_{t-1}), u(S_{t})$, number of utility samples to collect per round as $n$.
        \State {\bf Output}: $D_{i + 1}$
        \For{$j \in \text{range}(n)$}
        \State Sample random a pair of $(\utilitysample_{1}, \utilitysample_{2})$ from $S_{i}$ with equal size
        \State Compute distance between $\phi(\utilitysample_{1})$ and $\phi(S_{i})$ as $d_{1,i}$ and distance between $\phi(\utilitysample_{1})$ and $\phi(S_{i+1})$ as $d_{1, i+1}$. Same Rule applies to $\utilitysample_{2}$ to obtain $d_{2,i}$ and $d_{2,i+1}$.
        \State Calculate $u_{1}$, $u_{2}$ for $\utilitysample_{1}$ and $\utilitysample_{2}$ by Equation~\ref{interpolation}
        %= (d_{1,i} \cdot acc_{i-1} + d_{1,i-1} \cdot acc_{i})/(d_{1,i} + d_{1,i-1})$
        %\State Calculate $u_{2}$ for $\utilitysample_{2}$ by Equation~\ref{augmentation} %= (d_{2,i} \cdot acc_{i-1} + d_{2,i-1} \cdot acc_{i})/(d_{2,i} + d_{2,i-1})$
        \State $D_{i} \leftarrow D_{i} \cup \{ (\utilitysample_{1}, u_{1}), (\utilitysample_{2}, u_{2}) \}$
        \EndFor
    \State $D_{i + 1} \leftarrow D_{i}$
    \end{algorithmic}
\end{algorithm}
\subsubsection{Bi-Level Optimization}
% \vspace{-2mm}
\label{bilevel_opt}
To align with the growing labeled pool of AL setting, a core requirement of our utility model is the capability to \textit{generalize to longer and unseen data} by drawing on prior utility samples.
A line of research \citep{rajeswaran2019meta, liu2019self} suggests that meta-learning shall lead to fast adaptation and generalization to new tasks. One formulation of meta-learning is bi-level optimization \citep{maclaurin2015gradient} where the inner objective represents the adaptation to a given task and the outer problem is the meta-training objective.
Motivated by \citet{franceschi2018bilevel}, we formulate utility model training as bilevel optimization, combining gradient-based hyperparameter optimization and meta-learning in which the outer optimization problem is solved subject to the optimality of an inner optimization problem. To improve its generalization capability on samples with varied lengths, we divide the utility samples $(\utilitysample, u(\utilitysample))$ at iteration $i$ to training $D_{\text{tr}}$ and validation set $D_{\text{val}}$ by length, where $D_{\text{tr}}$ corresponds to utility samples with length smaller than the median and vice versa, and treat them as input dataset for \textit{inner objective} $L$ and \textit{outer objective} $E$. We consider the bilevel optimization framework as 
\begin{align}
\min_{\lambda}~ E(w(\lambda), \lambda) %\nonumber \\
%\text{subject to } 
\text{~~s.t. ~} w(\lambda) = \argmin_{\hat{w} \in \mathbb{R}^{d}} 
    \mathcal{L}(\hat{w}) \nonumber
    %l(\hat{w}, \lambda) 
\end{align}
where $\lambda$ is a hyperparameter, $E$ and $\mathcal{L}$ are continuously differentiable functions, the outer objective
\begin{align}
    E(w(\lambda), \lambda) := \sum\limits_{ \{(S_{1}', u(S_{1}')), (S_{2}', u(S_{2}'))\} \in D_{\text{val}}} \mathcal{L}_{\text{Total}}(\hat{w}) %l(\hat{u}(w), u)
    \nonumber
\end{align}
and the inner objective as
\begin{align}
     \begin{split}
     \mathcal{L}(\hat{w}) = \sum\limits_{\{(S_{1}', u(S_{1}')), (S_{2}', u(S_{2}'))\} \in D_{\text{tr}}} \mathcal{L}_{\text{Total}}(\hat{w}) + \Omega_{\lambda}(\hat{w}) \nonumber
\end{split}
\end{align}
where $D_{\text{tr}} = \{(\utilitysample_{1}, u(\utilitysample_{1})), (\utilitysample_{2}, u(\utilitysample_{2})) \}_{i=1}^{n}$ is a set of pair of utility samples attributed to the training set,  $\mathcal{L}_{\text{Total}}(\cdot)$ is the %BCE loss induced by the supervised algorithm 
loss function specified in \defref{total_loss}, and $\Omega_{\lambda}$ is a regularizer parametrized by $\lambda$. The outer objective is the proxy of the generalization error of $\hat{u}(\cdot)$, given by the average loss on $D_{\text{val}}$.

The inner optimization is aimed at \emph{utility model optimization}, i.e., finding the best parameters that minimize the loss on smaller length training samples $D_\text{tr}$. Conversely, the outer optimization targets to \emph{generalize the model to longer-length utility samples} $D_\text{val}$, which seeks the optimal regularizer parameterized by $\lambda$. With bilevel formulation, RAMBO shows better and more stable performance when performing unlabeled data selection on CIFAR10 with labeling budge 5000 (as suggested by Table~\ref{BilevelTraining1}).
Table~\ref{BilevelTraining1} shows the average performance of models with bilevel training used in optimization, which mostly outperforms the rest of counterparts without bilevel training, illustrating the enhanced generalizability across various model architectures and training algorithms.
% \vspace{-1mm}
\subsubsection{Interpolation-Based Utility Samples}\label{interpolation based surrogates}
% \vspace{-1mm}
In contrast to thousands or even millions of training samples for datamodels framework \citet{ilyas2022datamodels,engstrom2024dsdm}, the scarcity of utility samples poses challenges to the efficacy of our utility model training. We resort to the consistency regularization techniques from semi-supervised learning to augment artificial $(\utilitysample, u(\utilitysample))$. 
Inspired by \citet{parvaneh2022active}, the latent space of the classifier's feature extractor shall contain valuable representations that can be interpolated within labeled instances. The empirical success suggests a change in perspective---rather than twisting the classifier, we leverage the shared representations in $\hat{u}$ throughout the progress of optimization. In particular, we adopt the \textit{interpolation consistency regularization} strategy \citep{verma2022interpolation} (Definition~\ref{def:interp}). The pseudo code for utility samples augmentation is outlined in Algorithm \ref{alg:interpolate}.  
\begin{definition}[Utility Value Interpolation]
Denote the validation accuracy at iteration $i$ as $acc_{i}$. 
For a given utility sample $\utilitysample_{1}$, let $d_{1,i}$ be its distance \footnote{The OT distance is computed by utilizing utility model latent space representation.} with the previous labeled pool $S_i$ and $d_{1,i+1}$ the distance with the current labeled pool $S_{i+1}$. The augmented utility value $u_{1}$ for $\utilitysample_{1}$ yields as
\begin{align}
\label{interpolation}
    u_{1} &= \alpha \cdot u_{i} + (1 - \alpha) \cdot u_{i+1}
    % \vspace{-2mm}
\end{align}
with $\alpha := \frac{d_{1,i+1}}{d_{1,i+1} + d_{1,i}}.$\label{def:interp}
% \vspace{-4mm}
\end{definition}
