\subsection{Implementation Details}
For network architecture, we use ResNeXT50-32x4d~\cite{resnext} as our pre-trained backbone.
%
We use SGD with Nesterov momentum~\cite{sgd-nesterov} as optimizer and train the network for $8$ epochs.
%
The learning rate and scheduler is selected using the procedure from one-cycle policy~\cite{super-convergence} for fast convergence.
%
Since medical data is inherently unbalanced, we follow the weighting scheme of~\cite{cb-loss} to make training more stable.

%
For DBSCAN feature vector, we use the output vector of the trained backbone.
%
We follow~\cite{dbscan_eps} to pick the radius neighborhood of each data instance.
%
For the minimum number of data points of gist instances, we find that taking the inflection point of the previous step and divided by 10 works well (see~Fig.~\ref{fig:pca}).

\subsection{Datasets}
%
%
We study the AL's effectiveness in 3 cases: \textbf{1.} when the number of unlabeled data is abundant; \textbf{2.} when the number of unlabeled data is few; \textbf{3.} when NLP extracts the label from the medical report.

%
\textit{For the 1st case}, we use our private datasets of AO and LL.
%
The train set consists of $131,030$ data points with  $68,959$ positive AO instances and $12,848$ positive LL instances. 
%
Two or three radiologists annotated each instance. We chose these radiologists from a pool of more than 30 radiologists with at least five years of experience.
%
The test set consists of $4,279$ instances. Each instance is annotated by a group of five to eight radiologists from the same pool. It contains $1,789$ positive AO instances and $857$ positive LL instances.

%
\textit{For the 2nd case}, we use the public RSNA Pneumonia dataset\footnote{ https://www.kaggle.com/c/rsna-pneumonia-detection challenge}.
%
It consists of 26,684 data points, we then stratified split the data with a ratio of 9:1.
%
The final training set consists of 24,015 data points with 5,411 positive instances, and the final test set consists of 2,669 data points with 601 positive instances.

%
\textit{For the 3rd case}, we picked PE, the finding with the most positive samples, from the CheXpert dataset to study the difference between NLP labels and pseudo labels. 
%
For PE, we use the U-One~\cite{irvin2019chexpert} approach to assign positive labels to uncertainty instances.
%
We then test the models on the same test set as the 1st case.
%
The test set consists of $4, 279$ instances with $865$ positive samples.
%
Each instance was also annotated by a group of five to eight radiologists.



\begin{figure}[h]
    \floatconts
        {fig:pathology}
        {\caption{Performance gains of AL methods on private Airspace Opacity dataset, private Lung Lesion dataset and public Pneumonia dataset. Compare to other methods, GOAL has the most performance gain per data annotated.}}
        {
            \subfigure[Private Airspace Opacity Dataset]{
                \label{fig:airspace_data}
                \includegraphics[width=0.45\textwidth]{fig/airspace.png}
            }
            \subfigure[Private Lung Lesion Dataset]{
                \label{fig:lung_lesion_data}
                \includegraphics[width=0.45\textwidth]{fig/lung_lesion.png}
            } \\       
            \subfigure[Public Pneumonia Dataset]{
                \label{fig:pneumonia_data}
                \includegraphics[width=0.45\textwidth]{fig/pneumonia.png}
            }
            \subfigure[Public Pleural Effusion Dataset]{
                \label{fig:chexpert_data}
                \includegraphics[width=0.45\textwidth]{fig/pleural_effusion.PNG}
            }
        }
\end{figure}
\subsection{Analysis}
%
We retrospectively study how much data is needed for a model using AL acquired data to reach the same performance as a model using full data.
%
We construct many AL pipelines to compare with GOAL. The details is as follows.

%
\noindent
\textbf{Baseline}: We sample uniformly 10K instances for annotation each iterations.

%
\noindent
\textbf{CEAL:} We follow the author's process to assign labels to extremely high confidence instances $p_t \in [0, 0.001] \bigcup [0.999, 1]$.
%
For example, with lower confidences in the complementary interval, 10K were sampled uniformly for annotation.
%

\noindent
\textbf{Naive}: We expand CEAL's pseudo label interval to include more high confidence instances (i.e., $p_t \in [0, 0.1] \bigcup [0.9, 1]$).

\noindent
\textbf{Naive+}: Based on Naive, we further refine the sampling interval for data annotation to the most informative interval $p_t \in [0.4, 0.5]$.

\noindent
\textbf{Momentum}: We refine Naive+ method by replacing model's output probability with a running average of previous AL iterations.
%
The running average is calculated using Eq.~\ref{eq:momentum}.

%
\noindent
\textbf{GOAL}: Finally, we reduce the total amount of annotated data of the Momentum method by only selecting representative data points.

%
For all datasets, we sample an initial train set of $6,550$ instances from the original train set.
%
For each learning iteration, we sample a maximum of $6,500$ instances from the remaining train pool to add to the current train set\footnote{Each AL approach has a different way to sample instances in the big training set but all of them use the same initial set of training data}.
%
As shown in Table~\ref{tab:f1}, for the AO dataset, the performance of AL methods gets better as we refine CEAL into GOAL, the final AO F1 score gain going from CEAL to GOAL is $2.3 \%$ while reducing the amount of data used by CEAL from $24.84 \%$ to $17.13\%$ (Table~\ref{tab:data}). The GOAL method achieve best performance gain per annotated as shown in Fig.~\ref{fig:airspace_data}.

% Please add the following required packages to your document preamble:
% \usepackage{multirow}
% \usepackage{graphicx}
\begin{table}[]
\caption{F1 score performance of Airspace Opacity (AO), Lung Lesion (LL), Pneumonia (PN), and Pleural Effusion (PE) from the CheXpert dataset. We ran 10 iterations to investigate the effect of NLP and pseudo labels}
\label{tab:f1}
% \vspace{4mm}
\centering

\resizebox{1.03\textwidth}{!}{%
\hspace{-10mm}
\begin{tabular}{|l|c|ccccc|c|cccccccccc|}
\hline
\multicolumn{1}{|c|}{\multirow{2}{*}{Method}} &
  \multirow{2}{*}{Finding} &
  \multicolumn{5}{c|}{Active Learning Iteration} &
  \multirow{2}{*}{Finding} &
  \multicolumn{10}{c|}{Active Learning Iteration} \\ \cline{3-7} \cline{9-18} 
\multicolumn{1}{|c|}{} &
   &
  1 &
  2 &
  3 &
  4 &
  5 &
   &
  1 &
  2 &
  3 &
  4 &
  5 &
  6 &
  7 &
  \multicolumn{1}{l}{8} &
  \multicolumn{1}{l}{9} &
  \multicolumn{1}{l|}{10} \\ \hline
Full &
  \multirow{7}{*}{AO} &
  \multicolumn{5}{c|}{\textbf{0.871}} &
  \multirow{7}{*}{PN} &
  \multicolumn{10}{c|}{\textbf{0.631}} \\ \cline{3-7} \cline{9-18} 
Baseline &
   &
  \multicolumn{1}{c|}{\multirow{6}{*}{0.768}} &
  0.845 &
  0.856 &
  0.853 &
  0.857 &
   &
  \multicolumn{1}{c|}{\multirow{6}{*}{0.525}} &
  0.604 &
  0.599 &
  0.605 &
  0.607 &
  - &
  - &
  - &
  - &
  - \\
CEAL &
   &
  \multicolumn{1}{c|}{} &
  0.855 &
  0.836 &
  0.857 &
  0.848 &
   &
  \multicolumn{1}{c|}{} &
  0.567 &
  0.589 &
  0.613 &
  0.625 &
  - &
  - &
  - &
  - &
  - \\
Naive &
   &
  \multicolumn{1}{c|}{} &
  0.844 &
  0.855 &
  0.854 &
  0.861 &
   &
  \multicolumn{1}{c|}{} &
  0.574 &
  0.592 &
  0.612 &
  0.626 &
  - &
  - &
  - &
  - &
  - \\
Naive+ &
   &
  \multicolumn{1}{c|}{} &
  0.825 &
  0.845 &
  0.845 &
  0.845 &
   &
  \multicolumn{1}{c|}{} &
  0.556 &
  0.585 &
  0.616 &
  0.620 &
  - &
  - &
  - &
  - &
  - \\
Momentum &
   &
  \multicolumn{1}{c|}{} &
  \textbf{0.855} &
  0.862 &
  0.865 &
  \textbf{0.871} &
   &
  \multicolumn{1}{c|}{} &
  0.576 &
  0.608 &
  0.614 &
  0.623 &
  - &
  - &
  - &
  - &
  - \\
GOAL &
   &
  \multicolumn{1}{c|}{} &
  0.854 &
  \textbf{0.867} &
  \textbf{0.870} &
  \textbf{0.871} &
   &
  \multicolumn{1}{c|}{} &
  \textbf{0.576} &
  \textbf{0.607} &
  \textbf{0.619} &
  0.623 &
  - &
  - &
  - &
  - &
  - \\ \hline
Full &
  \multirow{7}{*}{LL} &
  \multicolumn{5}{c|}{0.743} &
  \multirow{7}{*}{PE} &
  \multicolumn{10}{c|}{0.785} \\ \cline{3-7} \cline{9-18} 
Baseline &
   &
  \multicolumn{1}{c|}{\multirow{6}{*}{0.688}} &
  0.698 &
  0.708 &
  0.719 &
  0.728 &
   &
  \multicolumn{1}{l|}{\multirow{6}{*}{0.755}} &
  \multicolumn{1}{l}{\textbf{0.764}} &
  \multicolumn{1}{l}{\textbf{0.755}} &
  \multicolumn{1}{l}{0.762} &
  \multicolumn{1}{l}{0.762} &
  \multicolumn{1}{l}{0.726} &
  \multicolumn{1}{l}{0.776} &
  \multicolumn{1}{l}{\textbf{0.786}} &
  \multicolumn{1}{l}{0.689} &
  \multicolumn{1}{l|}{0.731} \\
CEAL &
   &
  \multicolumn{1}{c|}{} &
  0.706 &
  0.719 &
  0.720 &
  0.737 &
   &
  \multicolumn{1}{l|}{} &
  \multicolumn{1}{l}{0.726} &
  \multicolumn{1}{l}{0.745} &
  \multicolumn{1}{l}{0.753} &
  \multicolumn{1}{l}{0.766} &
  \multicolumn{1}{l}{\textbf{0.770}} &
  \multicolumn{1}{l}{0.778} &
  \multicolumn{1}{l}{0.775} &
  \multicolumn{1}{l}{0.769} &
  \multicolumn{1}{l|}{\textbf{0.777}} \\
Naive &
   &
  \multicolumn{1}{c|}{} &
  0.705 &
  0.719 &
  0.733 &
  0.741 &
   &
  \multicolumn{1}{l|}{} &
  \multicolumn{1}{l}{0.731} &
  \multicolumn{1}{l}{0.727} &
  \multicolumn{1}{l}{\textbf{0.779}} &
  \multicolumn{1}{l}{0.758} &
  \multicolumn{1}{l}{0.692} &
  \multicolumn{1}{l}{0.760} &
  \multicolumn{1}{l}{0.768} &
  \multicolumn{1}{l}{0.763} &
  \multicolumn{1}{l|}{0.766} \\
Naive+ &
   &
  \multicolumn{1}{c|}{} &
  0.700 &
  0.706 &
  0.711 &
  0.735 &
   &
  \multicolumn{1}{l|}{} &
  \multicolumn{1}{l}{0.757} &
  \multicolumn{1}{l}{0.776} &
  \multicolumn{1}{l}{0.770} &
  \multicolumn{1}{l}{0.763} &
  \multicolumn{1}{l}{0.762} &
  \multicolumn{1}{l}{0.763} &
  \multicolumn{1}{l}{0.784} &
  \multicolumn{1}{l}{0.763} &
  \multicolumn{1}{l|}{0.769} \\
Momentum &
   &
  \multicolumn{1}{c|}{} &
  0.722 &
  \textbf{0.742} &
  \textbf{0.742} &
  0.749 &
   &
  \multicolumn{1}{l|}{} & 
 0.749 &
 0.744 &
 0.732 &
 0.758 &
 0.750 &
 0.780 &
 0.782 &
 0.775 &
 0.772 \\
GOAL &
   &
  \multicolumn{1}{c|}{} &
  \textbf{0.730} &
  0.735 &
  0.741 &
  \textbf{0.753} &
   &
  \multicolumn{1}{l|}{} &
  \multicolumn{1}{l}{0.750} &
  \multicolumn{1}{l}{0.743} &
  \multicolumn{1}{l}{0.733} &
  \multicolumn{1}{l}{0.757} &
  \multicolumn{1}{l}{0.748} &
  \multicolumn{1}{l}{\textbf{0.783}} &
  \multicolumn{1}{l}{0.780} &
  \multicolumn{1}{l}{\textbf{0.776}} &
  \multicolumn{1}{l|}{0.775} \\ \hline
\end{tabular}%
}
\end{table}

%
%
%


% Please add the following required packages to your document preamble:
% \usepackage{multirow}
% \usepackage{graphicx}
\begin{table}[]
\caption{Number of annotated data for each methods. The lowest amount of annotated data are in bold.}
\label{tab:data}
\vspace{5mm}
\centering
\resizebox{\textwidth}{!}{%
\begin{tabular}{|l|c|rrrr|c|rrrr|}
\hline
Method &
  Finding &
  \multicolumn{1}{c}{\#Neg.} &
  \multicolumn{1}{c}{\#Pos.} &
  \multicolumn{1}{c}{Total} &
  \multicolumn{1}{c|}{\%} &
  Finding &
  \multicolumn{1}{c}{\#Neg.} &
  \multicolumn{1}{c}{\#Pos.} &
  \multicolumn{1}{c}{Total} &
  \multicolumn{1}{c|}{\%} \\ \hline
Full &
  \multirow{7}{*}{AO} &
  68,959 &
  62,071 &
  131,030 &
  100.00 &
  \multirow{7}{*}{PN} &
  18,604 &
  5,411 &
  24,015 &
  100.00 \\
Baseline &
   &
  17,199 &
  15,351 &
  32,550 &
  24.84 &
   &
  11,005 &
  3,393 &
  14,398 &
  59.95 \\
CEAL &
   &
  20,135 &
  12,415 &
  32,550 &
  24.84 &
   &
  8,996 &
  2,671 &
  11,667 &
  48.58 \\
Naive &
   &
  18,342 &
  14,208 &
  32,550 &
  24.84 &
   &
  9,036 &
  2,785 &
  11,821 &
  49.22 \\
Naive+ &
   &
  \textbf{11,678} &
  \textbf{9,912} &
  \textbf{21,590} &
  \textbf{16.48} &
   &
  7,337 &
  \textbf{2,425} &
  9,762 &
  40.65 \\
Momentum &
   &
  13,799 &
  12,642 &
  26,441 &
  20.18 &
   &
  6,579 &
  2,528 &
  9,107 &
  37.92 \\
GOAL &
   &
  11,844 &
  10,603 &
  22,447 &
  17.13 &
   &
  \textbf{5,770} &
  2,426 &
  \textbf{8,196} &
  \textbf{34.12} \\ \hline
Full &
  \multirow{7}{*}{LL} &
  118,182 &
  12,848 &
  131,030 &
  100.00 &
  \multirow{7}{*}{PE} &
  104,550 &
  86,477 &
  191,027 &
  100.00 \\
Baseline &
   &
  26,946 &
  5,604 &
  32,550 &
  24.84 &
   &
  55,093 &
  45,951 &
  101,044 &
  52.90 \\
CEAL &
   &
  27,699 &
  4,851 &
  32,550 &
  24.84 &
   &
  76,030 &
  68,301 &
  144,331 &
  75.56 \\
Naive &
   &
  26,843 &
  5,707 &
  32,550 &
  24.84 &
   &
  55,234 &
  45,810 &
  101,044 &
  52.90 \\
Naive+ &
   &
  13,118 &
  \textbf{4,271} &
  17,389 &
  13.27 &
   &
  75,441 &
  62,621 &
  138,062 &
  72.27 \\
Momentum &
   &
  13,466 &
  5,224 &
  18,690 &
  14.26 &
   &
  45,309 &
  32,273 &
  77,582 &
  40.61 \\
GOAL &
   &
  \textbf{11,477} &
  4,848 &
  \textbf{16,32} &
  \textbf{12.46} &
   &
  \textbf{36,792} &
  \textbf{27,912} &
  \textbf{64,704} &
  \textbf{33.87} \\ \hline
\end{tabular}%
}
\end{table}

%
For LL dataset, there's a dip of $0.6\%$ in going from Naive to Naive+ method, we hypothesize that sampling from uncertainty region for an unbalanced class would result in drastic change on old pseudo label.
%
Therefore, when we use momentum to stabilize the pseudo label, the final F1 score take a drastic increase from $0.735$ to $0.749$
%
The final GOAL method achieves the best performance per annotated data as shown in Fig.~\ref{fig:lung_lesion_data}.

%
On PN dataset, all  AL  approaches  achieve  the  same comparable F1 score, we  hypothesize  this  to  be  the  result of lacking pseudo-labeled instance.
%
Despite that, GOAL only uses $34.12\%$ of the total data while CEAL needs to use $48.58\%$ (Table 2) to achieve the same performance.

%
We study the effect of active learning methods on CheXpert, the dataset with NLP generated annotation.
%
Fig.~\ref{fig:chexpert_data} shows that CEAL, momentum, and GOAL methods perform more stable than other methods.
%
We hypothesize this more stable performance is due to the consistency of using a small amount of extremely high confidence instances in CEAL and stable confidence instances in momentum and GOAL.
%
Furthermore, the unstable performance of all methods comes from only using NLP generated annotation. 
%
Therefore, manual annotation is required if active learning methods are to be applied in medical domain.

%TB: I find difficult to distungush between CEAL and GOAL (colors are similar), you can change one to dotted line




% Please add the following required packages to your document preamble:
% \usepackage{multirow}
% \begin{table}[]
% \footnotesize
% \caption{Confusion matrices between CEAL and GOAL pseudo label at each iteration. We only take the confusion matrix from iteration 2 since momentum starts after that.}
% \centering
% \begin{tabular}{|c|c|c|cc|cc|cc|cc|}
% \cline{1-11}
% \multicolumn{3}{|l|}{\multirow{3}{*}{Confusion Matrices}} &
%   \multicolumn{8}{c|}{GOAL} \\ \cline{4-11} 
% \multicolumn{3}{|l|}{} &
%   \multicolumn{2}{c|}{Iteration 2} &
%   \multicolumn{2}{c|}{Iteration 3} &
%   \multicolumn{2}{c|}{Iteration 4} &
%   \multicolumn{2}{c|}{Iteration 5} \\ \cline{4-11} 
% \multicolumn{3}{|l|}{} &
%   0 &
%   1 &
%   0 &
%   1 &
%   0 &
%   1 &
%   0 &
%   1 \\ \hline
% \multirow{2}{*}{AO} &
%   \multirow{4}{*}{CEAL} &
%   0 &
%   \multicolumn{1}{c|}{5} &
%   0 &
%   \multicolumn{1}{c|}{24} &
%   0 &
%   \multicolumn{1}{c|}{2} &
%   1 &
%   \multicolumn{1}{c|}{6} &
%   0 \\ \cline{4-11} 
%  &
%   &
%   1 &
%   \multicolumn{1}{c|}{9} &
%   23467 &
%   \multicolumn{1}{c|}{13} &
%   25896 &
%   \multicolumn{1}{c|}{11} &
%   27896 &
%   \multicolumn{1}{c|}{13} &
%   27086 \\ \cline{1-1} \cline{3-11} 
% \multirow{2}{*}{LL} &
%   &
%   0 &
%   \multicolumn{1}{c|}{0} &
%   0 &
%   \multicolumn{1}{c|}{0} &
%   0 &
%   \multicolumn{1}{c|}{0} &
%   0 &
%   \multicolumn{1}{c|}{0} &
%   0 \\ \cline{3-11} 
%  &
%   &
%   1 &
%   \multicolumn{1}{c|}{4} &
%   3319 &
%   \multicolumn{1}{c|}{1} &
%   3439 &
%   \multicolumn{1}{c|}{1} &
%   3680 &
%   \multicolumn{1}{c|}{0} &
%   3700 \\
%   \cline{1-11}
% \end{tabular}
% \end{table}

% \begin{figure}[h]
%     \centering
%     \includegraphics[width=\linewidth]{fig/pneumonia.PNG}
%     \caption{Pneumonia performance gain per annotated data}
%     \label{fig:pneumonia}
% \end{figure}