\documentclass{midl} % Include author names
%\documentclass[anon]{midl} % Anonymized submission

% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution









% My Packages
\usepackage{algpseudocode}
\usepackage{amsmath,amssymb,amsfonts}
\usepackage{array}
\usepackage{booktabs}       % professional-quality tables
%\usepackage{cite}
\usepackage{enumitem}
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts\
\usepackage{graphicx}
\usepackage{hyperref}       % hyperlinks
\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage{microtype}      % microtypography
%\usepackage{movie15}
%\usepackage[numbers,sort&compress]{natbib}
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
%\usepackage{subcaption}
\usepackage{textcomp}
\usepackage{url}            % simple URL typesetting
%\usepackage[dvipsnames]{xcolor}
\usepackage{nicefrac}

\usepackage{listings}
\usepackage{color}
\usepackage{pifont}
\newcommand{\cmark}{\ding{51}}%
\newcommand{\xmark}{\ding{55}}%

\newcolumntype{L}[1]{>{\raggedright\let\newline\\\arraybackslash\hspace{0pt}}m{#1}}
\newcolumntype{C}[1]{>{\centering\let\newline\\\arraybackslash\hspace{0pt}}m{#1}}
\newcolumntype{R}[1]{>{\raggedleft\let\newline\\\arraybackslash\hspace{0pt}}m{#1}}

\setlength{\abovedisplayskip}{1pt}
\setlength{\belowdisplayskip}{1pt}

\usepackage{tikz}
\def\checkmark{\tikz\fill[scale=0.4](0,.35) -- (.25,0) -- (1,.7) -- (.25,.15) -- cycle;}
%%%%%%










\usepackage{mwe} % to get dummy images
\jmlrvolume{-- Under Review}
\jmlryear{2020}
\jmlrworkshop{Full Paper -- MIDL 2020}
\editors{Under Review for MIDL 2020}

\title[Robustness to FN Annotations]{Brain Metastasis Segmentation Network Trained with Robustness to Annotations with Multiple False Negatives}

 % Use \Name{Author Name} to specify the name.
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % Two authors with the same address
 % \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\and
 %  \Name{Author Name2} \Email{xyz@sample.edu}\\
 %  \addr Address}

 % Three or more authors with the same address:
 % \midlauthor{\Name{Author Name1} \Email{an1@sample.edu}\\
 %  \Name{Author Name2} \Email{an2@sample.edu}\\
 %  \Name{Author Name3} \Email{an3@sample.edu}\\
 %  \addr Address}


% Authors with different addresses:
% \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\\
% \addr Address 1
% \AND
% \Name{Author Name2} \Email{xyz@sample.edu}\\
% \addr Address 2
% }

%\footnotetext[1]{Contributed equally}

% More complicate cases, e.g. with dual affiliations and joint authorship
\midlauthor{\Name{Darvin Yi\nametag{$^{1}$}} \Email{darvinyi@stanford.edu}\\
\addr $^{1}$ Department of Biomedical Data Science at Stanford University, Stanford, CA 94305 USA \AND
\Name{Endre Gr{\o}vik\nametag{$^{2}$}} \Email{endre.grovik@mn.uio.no}\\
\addr $^{2}$ Department for Diagnostic Physics at Oslo University Hospital, Oslo, Norway\AND
\Name{Michael Iv\nametag{$^{3}$}} \Email{miv@stanford.edu}\\
\addr $^{3}$ Department of Radiology at Stanford University\AND
\Name{Elizabeth Tong\nametag{$^{3}$}} \Email{etong@stanford.edu}\AND
\Name{Greg Zaharchuk\nametag{$^{3}$}} \Email{gregz@stanford.edu}\AND
\Name{Daniel Rubin\nametag{$^{1,3}$}} \Email{rubin@stanford.edu}
}

\begin{document}

\maketitle

\begin{abstract}
Deep learning has proven to be an essential tool for medical image analysis.  However, the need for accurately labeled input data, often requiring time- and labor-intensive annotation by experts, is a major limitation to the use of deep learning.  One solution to this challenge is to allow for use of coarse or noisy labels, which could permit more efficient and scalable labeling of images.  In this work, we develop a lopsided loss function based on entropy regularization that assumes the existence of a nontrivial false negative rate in the target annotations.  Starting with a carefully annotated brain metastasis lesion dataset, we simulate data with false negatives by (1) randomly censoring the annotated lesions and (2) systematically censoring the smallest lesions.  The latter better models true physician error because smaller lesions are harder to notice than the larger ones.  Even with a simulated false negative rate as high as 50\%, applying our loss function to randomly censored data preserves maximum sensitivity at 97\% of the baseline with uncensored training data, compared to just 10\% for a standard loss function.  For the size-based censorship, performance is restored from 17\% with the current standard to 88\% with our lopsided bootstrap loss.  Our work will enable more efficient scaling of the image labeling process, in parallel with other approaches on creating more efficient user interfaces and tools for annotation.
\end{abstract}

\begin{keywords}
Brain Metastasis,Segmentation,Deep Learning,False Negative,Noisy Label
\end{keywords}























\section{Introduction}\label{sec:Introduction}

In recent years, deep learning has advanced many areas of computer vision, such as image classification \cite{krizhevsky2012imagenet,simonyan2014very,szegedy2016rethinking,he2016deep,huang2017densely}, object detection \cite{ren2015faster,he2017mask,redmon2017yolo9000,lin2017focal}, and semantic segmentation \cite{long2015fully,ronneberger2015u,badrinarayanan2017segnet,he2015spatial,chen2017deeplab,chen2017rethinking}.  This has also led to an explosion in high-profile applications of deep learning in medical image analysis \cite{ting2017development,esteva2017dermatologist}.  Datasets in medical imaging are starting to achieve comparable sizes as those for classification-level labels; examples include CheXNet \cite{rajpurkar2017chexnet}, MURA \cite{rajpurkar2017mura}, and DREAM Digital Mammography Challenge. However, datasets for dense prediction tasks like segmentation remain limited, comprising at most hundreds of patients \cite{heller2019kits19,bakas2018identifying,pedrosa2019lndb,bilic2019liver}.

Current approaches to facilitate efficient generation of labeled data include improving management and sharing of expert annotations on medical images \cite{rubin2008medical,moreira20153d} and mining data automatically from the picture archiving and communication system (PACS) \cite{yan2018deeplesion,shin2015interleaved}. However, these methods generally involve processing data on physician interactions in a given database system, often requiring specialized code unique to each institution or database manufacturer.  Others have examined the potential for crowdsourcing labels \cite{albarqouni2016aggnet,irshad2014crowdsourcing}, as done for many computer vision datasets, but medical images often require extensive domain knowledge to achieve accurate annotations.  In this work, we focus not on increasing the quality or quantity of available data but rather on improving methods for learning from these data.  Specifically, we develop a network that is highly robust to noisy labels, especially false negatives (FNs), to reduce the requirements for large amounts of fine and dense annotations.

We build on the work of \citet{grovik2020deep}, using brain metastasis as a model system to evaluate the impact of FN annotations in a segmentation task.  Previous work in this area, such as \citet{lu2016learning}, showed that weakly supervised learning involving super-pixel alignment could help fix noisy boundaries between labels, but such methods would not be adequate for cases where full segments were not labeled, as when an entire lesion was missed by an annotator.  The weak label learner (WELL) \cite{sun2010multi} and multi-label with missing labels (MLML) \cite{wu2015multi} frameworks can deal with missing labels but not fully misclassified annotations.  Our work builds primarily on the foundation of \citet{reed2014training}, which trains a network on noisy (misclassified) labels.  We expand the method from classification to segmentation and also modify the loss function under our assumption that FNs, not false positives (FPs), are the main source of noise.

We present the following novel contributions:


\begin{enumerate}
    \item We have developed, to our knowledge, the first segmentation network dealing with whole-lesion FN labels.
    
    \item We have created a ``lopsided bootstrap loss'' that assumes prevalence of annotations with FNs and recovers performance through entropy regularization.
    
    \item We have demonstrated that this method preserves performance for exceptionally high induced FN rates (as much as 50\%), where FN lesions are chosen either at random or based on size.
\end{enumerate}






























\section{Data}\label{sec:Data}

This dataset, introduced in \citet{grovik2020deep}, comprises 156 patients examined at Stanford Hospital with known brain metastases and no prior treatment (surgery or radiation).  Our dataset is split 100/5/51 for training/validation/test.  The use of the validation set is described in \ref{sec:implementation}.  The test set was chosen to have 17 patients each having 1-3 lesions, 4-10 lesions, and 10+ lesions, to ensure that our model was not biased for cases with numerous or sparse lesions.  For each, four MR pulse sequences are available: pre- and post-contrast T1-weighted 3D fast spin echo (CUBE), post-Gd T1-weighted 3D axial inversion recovery prepped fast spoiled gradient-echo (BRAVO), and 3D fluid-attenuated inversion recovery (FLAIR).  The FLAIR as well as pre- and post-contrast CUBE series were co-registered to the post-contrast BRAVO series using the nordicICE software package (Nordic Neuro Lab, Bergen, Norway).  The annotations were also done on the BRAVO series.

The metastasis lesions were labeled by two neuroradiologists with a combined experience of 13 years.  Some annotations were made by a fourth-year medical student, which were then edited by the aforementioned neuroradiologists.  The labeling was done on a professional version of OsiriX \cite{rosset2004osirix} using the polygon tool, which requires the annotator to click once per vertex per slice per lesion.  The average 3D lesion required 45 clicks to fully annotate.  The annotation time depended on the number and size of lesions and was highly variable, ranging from about 1 minute for patients with a single large (greater than 1cm in diameter) lesion to 3 hours for patients with over a hundred smaller lesions.  For maximal use of physician time in gathering more data, no cases were read by more than one reader.  This design choice leaves our study's reference standard vulnerable to error.





















\section{Methods}\label{sec:Methods}

\subsection{Lesion Censoring}\label{sec:censoring}

We simulate FN by performing lesion censoring, or deleting lesions in the expert annotations.  %These annotation targets comprise one or more 3D connected components (CC) in the form of a binary mask.  Censoring a lesion means deleting a particular CC, or replacing all 1s with 0s.  Since the choice of which 3D CCs to censor is important, we evaluate two methods for lesion censoring: stochastic censoring and size-based censoring.


\begin{figure}[htbp]
 % Caption and label go in the first argument and the figure contents
 % go in the second argument
\floatconts
  {fig:lesionCensoring}
  {\caption{ \textbf{A graphical example of the lesion censoring process.}  \small{The example shown here illustrates lesion censoring with a 50\% stochastic censoring rate.}}}
  {\includegraphics[width=0.4\textwidth]{figures/simulation.png}}
\end{figure}


\textbf{Stochastic Censoring.}  In stochastic censoring, we specify a rate $p$ of FNs to induce in the labeled data.  Each lesion (or 3D CC) is then independently censored with probability $p$, yielding an overall FN rate of $p$ over all lesions in the training data.  Since censoring is stochastic, there are exponentially many possible combinations ($2^n$, where $n$ is the number of lesions) for censored vs. retained lesions.%  This method is unlikely to accurately model physician bias in annotating lesions, as clinician errors are not expected to follow a uniform random distribution.

\textbf{Size-Based Censoring.}  To better model clinician errors, we introduce size-based censoring.  In this method, we systematically censor the smallest lesions (by volume) across all patients to achieve the desired FN rate of $p$, such that proportion $p$ of the total number of lesions have been removed.%  Though the cut-off is quite harsh, this is a much better approximation for clinicians’ error.
This method generates the set of censored vs. retained lesions deterministically. While it better approximates physician error than does stochastic censoring, it does not capture other sources of error, such as lesions that are subtle due to lack of contrast with healthy tissue rather than small size.

For both stochastic and size-based censoring, we use a very harsh FN rate of 50\%.  While this rate may be too high to accurately simulate clinician error, this choice allows us to better evaluate the performance limit of our methods, as a method robust to such a high FN rate will likely accommodate the lower rate of typical clinical applications.





\subsection{Bootstrap Loss}\label{sec:bootstrap}

Here, we define the different loss functions with which we will train our network.%  In particular, we discuss how using loss functions other than the standard softmax cross-entropy can make networks more robust to input annotations with FNs.

\textbf{Class-Based Loss Weighting.} The most naive approach to solving the FN problem is to introduce an additional weighting for positive cases.  Since we consider the limit in which FNs represent the main source of annotation error, we can attempt to improve performance by upweighting the loss for incorrectly classifying the positive cases, as defined in equation \ref{eq:classBased}.


\begin{equation}\label{eq:classBased}
\mathcal{L} (Y,\hat{Y}) = \left\{\begin{array}{lc}
CE(Y,\hat{Y})&\text{if } Y == 0\\
\alpha * CE(Y,\hat{Y})&\text{if } Y == 1
\end{array}
\right.
\end{equation}

Specifically, we define a weighting parameter $\alpha$ which describes the multiplicative factor for the positive cases.  For example, with $\alpha = 10$, positive case pixels will receive $10\times$ the weight of negative case pixels.

\textbf{Bootstrap Loss.}  Based on \cite{reed2014training}, our work focuses on the bootstrap loss, as given in equation \ref{eq:bootstrap}. % Similarly, we can see that the bootstrap loss is a weighted average between our normal cross entropy loss with a second cross-entropy loss.  This secondary cross-entropy loss is between our predicted probabilities $\hat{Y}$ and our predicted classification one-hot encodings, $\text{argmax} (\hat{Y})$. 
This loss is a weighted average between the common cross entropy (CE) loss and a secondary CE loss between our predicted probabilities, $\hat{Y}$, and our predicted classification one-hot encodings, $\text{argmax} (\hat{Y})$.  This creates a feedback loop, where we push predictions to further increase the ``winning'' logit value and lower the other losing values.  Thus, this is a form of entropy minimization, and the bootstrap loss is a regularization of the entropy of our predicted probability distribution.


\begin{equation}\label{eq:bootstrap}
\mathcal{L}(Y,\hat{Y})= \beta * CE(Y,\hat{Y})+ (1-\beta)*CE(\text{argmax}(\hat{Y}),\hat{Y}))
\end{equation}

We define a parameter $\beta \in (0,1]$, which defines the proportion of loss represented by the classical cross entropy with the remainder being our bootstrap cross entropy.  For example, $\beta = 1$ represents the baseline case where we only use the standard cross entropy.  $\beta = 0.1$ would represent 10\% of our loss coming from the CE between our predictions and our (potentially noisy) target annotations while 90\% of our loss comes from the feedback loop of the bootstrap loss component.
A diagram of the bootstrap loss can be found in figure \ref{fig:diagram}.%  At every iteration of training, the loss is shared between treating the noisy annotations and our own binarized predictions as the target.


\begin{figure}[htbp]
 % Caption and label go in the first argument and the figure contents
 % go in the second argument
\floatconts
  {fig:diagram}
  {\caption{\textbf{Diagram of Bootstrap Loss}  \small{At every iteration of training, the loss is shared between treating the noisy annotations and our own binarized predictions as the target.}}}
  {\includegraphics[width=0.6\textwidth]{figures/diagram.png}}
\end{figure}


\textbf{Lopsided Bootstrap Loss.} Given our assumption that errors are predominantly FNs, we can further improve performance by introducing a lopsided bootstrap loss that also incorporates class-based loss weighting.  Taking our noisy target labels, we can separate the loss into two cases: (1) the target is positive and (2) the target is negative.  When the target is positive, we will weight the loss by our $\alpha$ factor, as in the class-based loss weighting.  In the case where the target is negative, we will apply the bootstrap loss with parameter $\beta$.  With $\beta = 1$, this loss simply reduces to class-based loss weighting where positive cases are upweighted by $\alpha$.


\begin{equation}\label{eq:lopsided}
\mathcal{L} (Y,\hat{Y}) = \left\{\begin{array}{lc}
\beta * CE(Y,\hat{Y})+ (1-\beta)*CE(\text{argmax}(\hat{Y}),\hat{Y}))&\text{if } Y == 0\\
\alpha * CE(Y,\hat{Y})&\text{if } Y == 1
\end{array}
\right.
\end{equation}

Thus, we introduce two hyperparameters $(\alpha,\beta)$ into our model search pipeline.  We chose exponentially distant values of $\alpha$ and $\beta$ to limit the number of models we would have to train and evaluate: $\alpha \in \{3,10,30\}$ and $\beta \in \{1, 0.5, 0.1\}$.




\subsection{Implementation Details}\label{sec:implementation}

We follow methodology very similar to prior work on metastasis segmentation from \citet{yi2019mri} and \citet{grovik2019handling}.  We use DeepLabv3 \cite{chen2017rethinking} for the core segmentation network, but our lopsided bootstrap loss is generalizable to any architecture.  Our network is a 2.5D network, with all 5 z-slices (with center slice being predicted upon) of all four pulse sequences being stacked in channel-space for our input.  Our input tensor would be of size $256 \times 256 \times 20$ for each frame.

All code was written using PyTorch \cite{paszke2017automatic}.  Networks were trained on a commercial-grade workstation with two NVIDIA 1080Ti GPUs.  Networks were left to train for about 10 epochs, or approximately 24 hours.  The network runs at about 100 ms per frame, corresponding to a runtime of about 30 s per 3D MR scan.  The network with the best performance on the validation set was chosen for testing.  The validation set had the same lesion censorship as the training set, whether we had no censorship, stochastic censorship, or size-based censorship.  The test set had no censorship and used all lesions of the original annotation set.

The baseline model as reported in \citet{yi2019mri} had extensive hyperparameter search, to ensure that the baseline approach had the strongest possible model.  All other models used the same L2 parameter, learning rate, and annealing rate as the baseline model.  It could be possible to further optimize each proposed model further.


\subsection{Metrics}\label{sec:metrics}

We report three main metrics for our networks' performance: (1) mean average precision (mAP) with respect to detection of brain lesions, (2) maximum sensitivity, and (3) segmentation DICE score of the true positives at maximum sensitivity.  To derive these metrics, we must define a true positive (TP).  We first binarize our 3D segmentation probability maps with probability threshold 10\% and calculate the centroid of each predicted 3D CC.  If the centroid is within 1 mm of any ground truth annotation, that 3D CC is treated as a TP annotation.  The predicted confidence of each 3D CC is calculated as the average predicted probability from the original probability map of each voxel in the 3D CC.  With the list of centroids and confidences, we create a precision-recall (PR) curve, the area under which is the mAP value.  95\% confidence intervals (CIs) are reported for mAP scores using the method described by Hanley and McNeil \cite{hanley1983method}.  Allowing all 3D CCs after binarization to be predictions, we can also calculate our maximum sensitivity.  We finally report the TP DICE scores as a segmentation metric for the lesions we do predict correctly.

























\section{Results}\label{sec:results}

\subsection{Stochastic Lesion Censoring}\label{sec:stochasticCensoring}

Table \ref{tab:randomCens} shows metrics for training on stochastically censored data.  With simple class-based loss weighting ($\beta = 1$), the maximum sensitivity falls to 10\% of that for the baseline without censoring.  After incorporating the lopsided bootstrap loss function ($\beta = 0.5$ or $\beta = 0.1$), this performance is largely recovered, up to 97\% of the baseline.  We also tested the network with $\alpha=30$, which resulted in predictions of over 99\% probability for every voxel regardless of the corresponding $\beta$ value.


\begin{table}[htbp]
\floatconts
  {tab:randomCens}%
  {\caption{Stochastic Lesion Censoring with FN Rate 50\%}}%
  {\begin{tabular}{lllll}
  \bfseries Training Data & \bfseries Loss $(\alpha,\beta)$ & \bfseries mAP (95\% CI) & \bfseries Max Sensitivity & \bfseries TP DICE\\
  \midrule
  \midrule
  Full & 3, 1 & 46 (44,47) & 80 & 72\\
  \midrule
  \nicefrac12 Censored Data & 3, 1 & 20 (15,22) & 8 & 54\\
  \nicefrac12 Censored Data & 10, 1 & 6 (2,9) & 15 & 48\\
  %\nicefrac12 Censored Data & 30, 1 & 0 (0,0) & 0 & 0\\
  \midrule
  \nicefrac12 Censored Data & 3, 0.5 & 39 (36,41) & 76 & 75\\
  \nicefrac12 Censored Data & 10, 0.5 & 29 (25,32) & 53 & 69\\
  %\nicefrac12 Censored Data & 30, 0.5 & 0 (0,0) & 0 & 0\\
  \midrule
  \nicefrac12 Censored Data & 3, 0.1 & 42 (40,44) & 78 & 73\\
  \nicefrac12 Censored Data & 10, 0.1 & 35 (31,37) & 63 & 71\\
  %\nicefrac12 Censored Data & 30, 0.1 & 0 (0,0) & 0 & 0\\
  \end{tabular}}
\end{table}


Figure \ref{fig:entropy} shows a histogram of predicted voxel probabilities from our segmentation network on all of the test patients among our baseline model (Table \ref{tab:randomCens}, row 1), direct training on censored data (Table \ref{tab:randomCens}, row 2), and our best performing bootstrap model (Table \ref{tab:randomCens}, row 6).  The predictions have been further colored to represent the predicted probabilities of the voxels corresponding to the ground truth (GT) lesion and GT normal classes.  Thus, an ideal network would predict 0 (far left) for GT normal and 1 (far right) for GT lesion voxels.  %The predicted probability values of our baseline model (figure \ref{fig:entropyBaseline}) generally follow this pattern, with low values predicted for most GT normal voxels and high values for most GT lesion voxels.
If we renormalize the GT lesion probability histogram and calculate the Shannon entropy, we get a value of 1.74.  In stark contrast, the model trained directly on the censored data without the bootstrap loss (Figure \ref{fig:entropyCensored}) rarely predicts a high probability value.  The entropy of our lesion probabilities is 4.38.  Our bootstrap loss (Figure \ref{fig:entropyBootstrap}) recovers some of the characteristics of our baseline model despite training on the highly censored data.  The entropy of the lesion probabilities is 3.14.


\begin{figure}[htbp]
\floatconts
  {fig:entropy}
  {\caption{\textbf{Entropy of predicted probabilities.} \small{We separate the predicted probabilities of voxels corresponding to the positive (in blue) and negative (in red) class.}}}
  {%
    \subfigure[Baseline]{\label{fig:entropyBaseline}%
      \includegraphics[width=0.29\linewidth]{figures/entropy_baseline.png}}%
    \qquad
    \subfigure[Censored]{\label{fig:entropyCensored}%
      \includegraphics[width=0.29\linewidth]{figures/entropy_censored.png}}
    \qquad
    \subfigure[Bootstrap]{\label{fig:entropyBootstrap}%
      \includegraphics[width=0.29\linewidth]{figures/entropy_bootstrap.png}}
  }
\end{figure}
%

Figure \ref{fig:qualitative} shows representative images of the segmentations that result from training on the censored data.  We can see the effects of different $\alpha$ and $\beta$ hyperparameters on our segmentations in Figure \ref{fig:exampleTable}.  Increasing $\alpha$ and decreasing $\beta$ both make the network more sensitive.  However, decreasing $\beta$ better matches image-level gradients, which is advantageous for the task of segmentation.  Figure \ref{fig:exampleTotal} shows us the segmentation performance on different tiers of lesion sizes of the $(\alpha,\beta) = (3,0.1)$ network.  We can see that it performs well across a range of lesion sizes.


\begin{figure}[htbp]
\floatconts
  {fig:qualitative}
  {\caption{\textbf{Qualitative examples of segmentation results.}}}
  {%
    \subfigure[Hyperparameter Effects]{\label{fig:exampleTable}%
      \includegraphics[height=0.34\linewidth]{figures/table.png}}%
    \qquad
    \subfigure[Segmentation Examples]{\label{fig:exampleTotal}%
      \includegraphics[height=0.34\linewidth]{figures/example.png}}
  }
\end{figure}







\subsection{Size-based Lesion Censoring}\label{sec:sizeBased}

Table \ref{tab:sizeCens} shows metrics regarding our size-based lesion censoring experiments.  As with stochastic censoring, training on the censored data with naive class-based loss weighting reduces performance drastically, with maximum sensitivity falling to 17\% of that from the baseline with no censoring.  We again demonstrate that the lopsided bootstrap loss restores performance to 88\% of baseline.  Similar to the experiments with random censorship, choosing $\alpha = 30$ abolished network performance.

%
\begin{table}[htbp]
 % The first argument is the label.
 % The caption goes in the second argument, and the table contents
 % go in the third argument.
\floatconts
  {tab:sizeCens}%
  {\caption{Size-Based Lesion Censoring with FN Rate 50\%}}%
  {\begin{tabular}{lllll}
  \bfseries Training Data & \bfseries Loss $(\alpha,\beta)$ & \bfseries mAP (95\% CI) & \bfseries Max Sensitivity & \bfseries TP DICE\\
  \midrule
  \midrule
  Full & 3,1 & 46 (44,47) & 80 & 72\\
  \midrule
  \nicefrac12 Censored Data & 3, 1 & 22 (19,24) & 14 & 61\\
  \nicefrac12 Censored Data & 10, 1 & 18 (14,20) & 9 & 51\\
  %\nicefrac12 Censored Data & 30, 1 & 0 (0,0) & 0 & 0\\
  \midrule
  \nicefrac12 Censored Data & 3, 0.5 & 32 (19,34) & 68 & 71\\
  \nicefrac12 Censored Data & 10, 0.5 & 18 (15,20) & 50 & 68\\
  %\nicefrac12 Censored Data & 30, 0.5 & 0 (0,0) & 0 & 0\\
  \midrule
  \nicefrac12 Censored Data & 3, 0.1 & 39 (37,41) & 71 & 71\\
  \nicefrac12 Censored Data & 10, 0.1 & 19 (17,21) & 51 & 69\\
  %\nicefrac12 Censored Data & 30, 0.1 & 0 (0,0) & 0 & 0\\
  \end{tabular}}
\end{table}
%

Figure \ref{fig:sizehist} shows the performance with respect to lesion size of our bootstrap model trained on data with no, random, or sized-based censorship.  The network trained on randomly censored data achieves better performance with smaller lesions as compared to the network trained on size-censored lesions.  Indeed, the latter misses all lesions with diameter less than 4.8 mm. 


\begin{figure}[htbp]
\floatconts
  {fig:sizehist}
  {\caption{\textbf{Prediction accuracy sorted by lesion size.}}}
  {%
    \subfigure[Target Annotations]{\label{fig:histGT}%
      \includegraphics[width=0.29\linewidth]{figures/sizeHist_gtExperiments.png}}%
    \qquad
    \subfigure[Random Censoring]{\label{fig:histRandom}%
      \includegraphics[width=0.29\linewidth]{figures/sizeHist_randCens.png}}
    \qquad
    \subfigure[Size-based Censoring]{\label{fig:histSize}%
      \includegraphics[width=0.29\linewidth]{figures/sizeHist_sizeCens.png}}
  }
\end{figure}
%





\subsection{Relationship to Patient Count}\label{sec:ptcount}

We also investigated how performance varies with dataset size by training the network on randomly subsampled patient cohorts.%  We compared the performance of the baseline network on non-censored data to that of the networks trained with bootstrap loss on data with stochastic or size-based censoring, as shown in table \ref{tab:ptComp}.
Training on 100 patients with a FN rate as high as 50\% achieves comparable performance to training on 30 patients with fine labels.  Similarly, we find similar performance training on 30 patients with noisy labels as on 10 patients with fine labels.


\begin{table}[htbp]
 % The first argument is the label.
 % The caption goes in the second argument, and the table contents
 % go in the third argument.
\floatconts
  {tab:ptComp}%
  {\caption{Comparison of Performance with Respect to Patient Count}}%
  {\begin{tabular}{llllll}
  \bfseries Tr. Data & \bfseries Pt Count & \bfseries $(\alpha,\beta)$ & \bfseries mAP (95\% CI) & \bfseries Max Sens. & \bfseries TP DICE\\
  \midrule
  \midrule
  Full & 10 & 3, 1 & 25 (21,27) & 44 & 63\\
  Full & 30 & 3, 1 & 39 (37,41) & 70 & 72\\
  Full & 100 & 3, 1 & 46 (44,47) & 80 & 72\\
  \midrule
  Stochastic & 30 & 3, 0.1 & 30 (27,32) & 54 & 69\\
  Stochastic & 100 & 3, 0.1 & 42 (40,44) & 78 & 73\\
  \midrule
  Size-Based & 30 & 3, 0.1 & 25 (21,27) & 48 & 69\\
  Size-Based & 100 & 3, 0.1 & 39 (37,41) & 71 & 71\\
  \end{tabular}}
\end{table}















\section{Discussion}\label{sec:discussion}

\textbf{Using the lopsided bootstrap loss preserves performance when training on censored data.} From Tables \ref{tab:randomCens} and \ref{tab:sizeCens}, we see that inducing a high FN rate, such as 50\%, drastically reduces performance regardless of censorship model.  Because of the massive voxel imbalance between the positive and negative cases, we already upweight positive pixels by a factor of 3.  However, as we increase the $\alpha$ value further, we start to see a decline in overall performance, as seen in the mAP values.%  This is consistent among both types of censorship.  Upon visualizing the results of the higher $\alpha$ values, we see that the networks have become so sensitive that the segmentation boundaries become more inaccurate, inducing a centroid shift that causes some of our original true positives to become false positives.
The lopsided bootstrap error is a better way to improve performance, as seen quantitatively in the aforementioned tables as well as qualitatively in the examples of Figure \ref{fig:exampleTable}.

\textbf{The lopsided bootstrap loss can be seen as a form of entropy regularization on the positive lesion probabilities.}  This can be seen most clearly by looking at the predicted probability values on our test set, as shown in Figure \ref{fig:entropy}.  Our baseline model shows the prediction behavior typical of most networks: predicting primarily high or low (but not intermediate) probabilities.  When we censor our data, %we see the behavior shown in Figure \ref{fig:entropyCensored}.  Since 
voxels of similar lesions may be annotated as either negative (normal) or positive (lesion) classes, the GT lesion voxels (Fig. \ref{fig:entropyCensored}) show a more even distribution of probabilities from 0 to 1, resulting in a much higher entropy.  Our bootstrap loss%, especially when we apply it only when the GT label is negative,
creates a lopsided entropy regularization %that minimizes the entropy with respect to positive case lesions.  This is done
 by creating a positive feedback loop%, such that even slightly positive or negative predictions tend to become strongly positive (1) or negative (0) respectively
 .  This property can be detrimental, so networks should always be tested on datasets with minimum noise where possible.

\textbf{Using the bootstrap loss cannot fully abolish size-based biases.}  Though performance is largely recovered in our size-based censorship experiments (Table \ref{tab:sizeCens}, our error profiles (Figure \ref{fig:histSize}) show that most small lesions are missed, as the training data contains only annotated lesions larger than a certain diameter.  Interestingly, our network performs better when trained with default $(\alpha,\beta)$ values given size-based censorship compared to the default network trained with stochastic censorship.  One possible reason is that size-based censorship still allows the network to learn features of larger lesions without conflicting signals, since all large lesions will be labeled correctly.  However, this also means that the network actively learns that features typical of smaller lesions should be labeled as normal.  As in any application of machine learning, even an optimized loss function cannot recover signals not present in the input data.

\textbf{When developing deep learning applications, consideration should be given not only to the number of samples required for the desired network performance but also the other costs of acquiring such data, such as annotator time.}%  The dependence of network performance on dataset size is routinely evaluated, but our work demonstrates that other characteristics of the data are also important.  
Table \ref{tab:ptComp} shows that our network achieves comparable performance with a larger but exceptionally noisy (FN rate of 50\%) dataset as with a smaller but finely labeled dataset.  Therefore, more rapidly collecting noisy data could be a beneficial tradeoff.  Another application would be in mixing datasets that provide labels for different diseases that are not mutually exclusive.  Our lopsided bootstrap loss function would enable training on the combined data by addressing the FNs due to each dataset potentially missing annotations for the other disease.%To truly scale deep learning performance, we must more efficiently leverage existing data and collect new data.
Our work on a novel method for addressing a high prevalence of FNs in training data enables improved utilization of noisy data and complements ongoing efforts to generate more data.

\textbf{We note that this study comes with limitations.}  Our simulated FN annotation does not fully simulate true clinicians' error.  Additionally, 50\% FN rate is too high for a true clinical simulation. However, by erring on the side of more false negatives, we hope to show the strength of our methods at the cost of accurate clinical simulation.  Though small data is a limitation of this study, we believe that since learning with FN annotations presents problems independent of patient count, our main contribution (lopsided bootstrap loss) will still be useful with larger data.  Finally, our original annotations not having been cross-validated among multiple readers for measurement of inter-reader variability limits our understanding of the accuracy of our non-lossy target annotations.  We hope to continue to scale up our data collection process, including measuring inter-reader variability, as well as validating our method on other lesion-based datasets, such as lung nodules or liver lesions.























\section{Conclusion}\label{sec:conclusion}

In our work, we have shown how using the lopsided bootstrap loss can help improve performance when training on a dataset whose annotations have multiple false negatives.  Though the improvement is stronger when the underlying cause of the false negatives is random, it still works if the label noise happens with some bias.  We hope that by creating algorithms that are more robust to noisy data and weaker labels, we can expand the domain of what annotations are usable to train deep learning networks.




















%\clearpage
% Acknowledgments---Will not appear in anonymized version
\midlacknowledgments{We thank Stanford Hospital for providing the data needed to complete this study.  We acknowledge the T15 LM 007033 NLM Training grant in funding this project.  This work was also supported in part by grants from the National Cancer Institute, National Institutes of Health, U01CA142555, 1U01CA190214, 1U01CA187947, and U01CA242879.}




\clearpage
\bibliography{yi20}

\end{document}