%\subsection{Overview} 
%Figure~\ref{fig:online} is a pictorial description of our two-phase data construction flow. 
%%
%It consists of the initial phase which attempts to construct a reasonably modest model. 
%The iterative phase involves incremental human-annotation in which the Online update operation plays an important role to minimize the necessary human intervention. 
%
%\subsubsection{Initialization}
%
%In this first phase, we create simultaneously the very first training testsets. 
%%
%Each has a formally defined number of images with annotations from medical experts, i.e., batch data. 
%%
%These two datasets are used to train and evaluate the based model $\theta_{base}$.
%% 
%In case the evaluation metric $f_1$ does not match an expected value $f_1^{base}
%$, another batch of data will be labeled and added into the training set. 
%%
%This round of iteration will be terminated when the $f_1$ score achieves as good as $f_1^{base}$.
%%
%At this point, the training set contains $n_{init}$ batches and the testset has one batch.
%
%\subsubsection{Online Human-in-the-loop with Momentum}:
%Having the reasonably modest base model $\theta_{base}$, we then select the rest of unlabeled dataset, combined with the training set ($n_{init}$ batches) and the test set to perform the inference. 
%%
%For the unlabeled set, at a certain iteration $t$, the inference $\hat{p}_t$ is weighted averaging with the previous $\hat{p}_{t-1}$ and the current estimation ${p}_t$, which is formally defined as %$\hat{p}_t = \mathcal{F}_\mu (\hat{p}_{t-1}, {p}_{t})$, in which $\mathcal{F}_\mu$ is an online update operator with momentum $\mu$ (will be explained in subsection~\ref{ssec:momentum}). 
%%
%The results of inferences $\hat{p}_t$ potentially contains samples with high confident scores $\hat{p}_t^H$ and low confident scores $\hat{p}_t^L$. 
%%
%The subset of samples with high confident scores $\hat{p}_t^H$ is then hard-thresholded by a value $\tau_t$ to produce $\hat{p}_t^{H, thresh}$ and combined with relabeling annotation from doctors on %low confident scores $\hat{p}_t^{L, relabel}$
%%
%On the other hand, the current training set and test set being re-evaluated are also based on the threshold $\tau_t$ to extract \textbf{outliers} which also need relabeling from doctors. 
%% 
%Modified triplet of datasets (updated unlabeled, training and test set) after doctor intervention are then used to train a new snapshot of the model $\theta_{t}$ for some iteration $t$. 
%%
%Terminal condition is trigger if the evaluation metric satisfied, i.e., $f_1 > f_1^{target}$
%%
%In the next online active learning iteration, the training set is \textbf{corrected} itself by relabeling on associative outliers, and is \textbf{expanded} by an amount of new labeling on low %confident scores $\hat{p}_t^{L, relabel}$ of unlabeled dataset. 
%%
%Our argument is that with this approach, the additionally number of labeling samples on unlabeled data plus the number of relabeling on training and test sets, after a long run, is minimal in terms of %cost. 
%
%
%\begin{figure}[h]
%\caption{}
%\label{fig:online}
%\end{figure}


\subsection{Decision Boundary}

%In binary classification task, a deep CNN uses the distance from the decision boundary, parameterized by learned weight vector, and feature vector extracted from an image to make a decision of whether %the image has a predefined label or not.
%
%\begin{equation}
%    \begin{aligned}
%        p(c=1|x) &= \sigma(w^{T}f(x,\theta))\\
%        p(c=0|x) &= 1 - p(c=1|x)
%    \end{aligned}
%\end{equation}
%
%\noindent
%where $f(x,\theta)$ is the feature vector, $w$ is the learned weight vector and the sigmoid function $\sigma$.
%
A typical approach to AL~\cite{ceal} is to get an uncertainty using entropy measure

\begin{equation}
    \mathcal{U}(x) = \sum_{c=0,1} -p(c|x)\log(p(c|x))
\end{equation}
\noindent
followed by assigning label to low uncertain score instances, and randomly sample high uncertain instances for human annotation. 
%
Such approach is the same as sampling high confidence instances\footnote{Instances with high $p(c=1|x)$ or $p(c=0|x)$} to assign label and low confidence instances for human annotation.
%
However, high confidence instances only yield small information gain since its feature vector lies deep above/under the decision boundary (Fig.~\ref{fig:decision_boundary}).
%
In contrast, instances whose feature vectors are in the neighborhood of the decision boundary present the most uncertain for the model to assign a specific label, thus they are the most informative for further learning.


\subsection{Gist Data Point}
Furthermore, our aim is to reduce the amount of data that needs to be annotated.
%
In other word, we want to select only those data points that represents the global structure of the decision boundary neighborhood.
%
However, due to the complexity of a deep CNN, the global structure of the result feature space is not well understood.
%
Despite that, deep CNN was shown to be a good feature extractor that maps an image to an embedded high dimensional sphere~\cite{face-net}.
%
Based on that, we hypothesize that the general feature space of a deep CNN can be treated as an embedded manifold inside a flat Euclidean space\footnote{The reason we assume embedding instead of immersion is because we want images with different visual to have distanced feature vector}.
%
Using that hypothesis, we first build a local neighborhood around each data point in the neighborhood of the decision boundary using the method from~\cite{dbscan_eps} which is quite robust, compared to global structure of the feature space.
%
We, then, define gist points as each of them has at least some minimum amount of data points inside its neighborhood~\cite{dbscan_eps}.
%
Finally, we sample from the set of gist points to reduce the amount of data.
%
%Furthermore, as shown in~\cite{data_distillation}, using multiple view of input data, such as: flip, crop and zoom, results in a more robust estimation of the confidence score. 
%
%Therefore, we adopt the use of adding test time augmentations in all of our experiments.

%\subsection{Human in the Loop}
%
%For human annotator, it's easy to make mistake after working with thousands of data instances.
%
%Such mistake will make the training set noisy which, in turn, makes training harder to converge.
%
%To reduce human mistake, we adopt the approach in \cite{cleanlab} to detect all possible noisy labeled instances in the training set to give them to the annotator for label reevaluation.

\begin{figure}[t]
\centering
\includegraphics[width=0.9\linewidth, scale=0.25]{fig/border.png}
\caption{High confidence sample are far from decision boundary, so they're not as informative as those that are near the boundary.}
\label{fig:decision_boundary}
\end{figure}

\subsection{Online Learning with Momentum}\label{ssec:momentum}
%We define high confidence data instances as those with confidence score in the range of $[0, 0.1] \cup [0.9, 1]$, such definition correspond to picking instances with $\mathcal{U}$ less than $0.325$.
%
%Active learning method like CEAL~\cite{ceal} uses current model to assign label to high confidence intances.
%
Labeled data is inherently noisy at the begining of an AL iteration, the model may not learn enough feature to generate consistent label e.g. high confidence instances in an iteration may become low confidence instances in the next one.
%
Therefore, we adopt the approach of using a running average to stabilize the output of the model.
%
\begin{equation}
    \label{eq:momentum}
    \hat{p}_t = \mu\hat{p}_{t-1} + (1-\mu)p_t
\end{equation}
\noindent
where $\hat{p}_t$ is the confident score after applying momentum modification, $p_t$ is the original confidence score of the model in iteration $t$ and $\mu$ controls how much past score affects the final score.
%
For $\mu \in [0, 0.4]$, $\hat{p}_{t-1}$ has little effect on the final confidence score, therefore instances with small fluctuation in $p_t$ will get removed from the pseudo-labeling process.
%
Such aggressive removal is uncalled for since instances with $p_t \in [0.8, 0.9)$ but $\hat{p}_{t-1} \geq 0.9$ can be treated as high confidence data point.
%

%
In contrast to that, for $\mu \in [0.6, 1]$, instances with $p_t < 0.6$ may end up with $\hat{p}_t \geq 0.9$, which can destabilize the training process because unstable instances are being kept in the training set.
%
Therefore we set $\mu = 0.5$ in our experiment.

%\subsection{Core Set Selection}

%
%As shown in Fig.~\ref{fig:pca}, core points cover almost all the region of the non-discriminative set.

\begin{figure}
    \centering
    \includegraphics[width=0.7\linewidth, scale=0.25]{fig/pca.png}
    \caption{Lung Lesion CXR instances with $\hat{p}_t \in [0.4, 0.6]$. Blue dots represent normal data point, orange dots represent gist points which cover almost all instances.}
    \label{fig:pca}
\end{figure}

%\begin{figure}
%    \centering
%    \includegraphics[width=\linewidth]{fig/eps.png}
%    \caption{todo: placeholder}
%    \label{fig:eps}
%\end{figure}

%\begin{figure*}
%    \centering
%    \subfloat[][]
%    {
%        \includegraphics[width=\linewidth]{fig/mu1.png}
%    }\\
%    \subfloat[][]
%    {
%        \includegraphics[width=\linewidth]{fig/mu2.png}
%    }\\
%    \subfloat[][]
%    {
%        \includegraphics[width=\linewidth]{fig/mu3.png}
%    }
%    \caption{Contour plot of final confident score for different momentum values. For $\mu$ lies in the range $[0, 0.4]$, the effect of current model prediction is too prominent. For $\mu$ lies in range [0.6, 1], the %final confidence score is too lenient for current model prediction, since it allows current low confident instances to have high final confident score. For $\mu$ at value of $0.5$, the final prediction is balanced %from current and past predictions since it allows the previously high confidence instances to have at most $0.1$ reduction in current prediction}
%    \label{fig:mu}
%\end{figure*}
