\documentclass{midl} % Include author names
%\documentclass[anon]{midl} % Anonymized submission

% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution
\usepackage{booktabs}
\usepackage{natbib}
\usepackage{float}
\usepackage{graphicx}
%\usepackage{subcaption}
%\usepackage{todonotes}
\usepackage{chngcntr}
\usepackage{multirow}
\usepackage{mwe} % to get dummy images
%\usepackage{caption}
\usepackage{hyperref}
\usepackage[T1]{fontenc}
\usepackage{enumitem}


%\jmlrvolume{-- Under Review}
\jmlryear{2020}
\jmlrworkshop{Full Paper -- MIDL 2020}
%\editors{A for MIDL 2020}

%\def\blind#1{REMOVED}
\def\blind#1{#1}


\usepackage{fixfoot}
\DeclareFixedFootnote*{\githuburl}{\url{\blind{https://github.com/mlmed/torchxrayvision}}}

\title[X-ray cross-domain generalization]{On the limits of cross-domain \\ generalization in automated X-ray prediction}

 % Use \Name{Author Name} to specify the name.
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % Two authors with the same address
 % \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\and
 %  \Name{Author Name2} \Email{xyz@sample.edu}\\
 %  \addr Address}

 % Three or more authors with the same address:
 % \midlauthor{\Name{Author Name1} \Email{an1@sample.edu}\\
 %  \Name{Author Name2} \Email{an2@sample.edu}\\
 %  \Name{Author Name3} \Email{an3@sample.edu}\\
 %  \addr Address}


% Authors with different addresses:
% \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\\
% \addr Address 1
% \AND
% \Name{Author Name2} \Email{xyz@sample.edu}\\
% \addr Address 2
% }

%\footnotetext[1]{Contributed equally}

% More complicate cases, e.g. with dual affiliations and joint authorship
\midlauthor{%
\Name{Joseph Paul Cohen} \Email{joseph@josephpcohen.com}\\
\Name{Mohammad Hashir} \Email{mkhan31@vols.utk.edu}\\
\Name{Rupert Brooks} \Email{rupert.brooks@nuance.com}\\
\Name{Hadrien Bertrand} \Email{hadrien.bertrand@mila.quebec}\\
\addr Mila, Universit\'{e} de Montr\'{e}al
}

% Color definitions from: http://latexcolor.com/
\definecolor{alizarin}{rgb}{0.82, 0.1, 0.26}
\definecolor{azure}{rgb}{0.0, 0.5, 1.0}

% Hyperref setup
\hypersetup{
    linktoc=all,
    linktocpage=true,
    colorlinks=true,
    citecolor=azure,
    linkcolor=azure,
    urlcolor=alizarin
}

\begin{document}

\maketitle

\begin{abstract}
This large scale study focuses on quantifying what X-rays diagnostic prediction tasks generalize well across multiple different datasets. We present evidence that the issue of generalization is not due to a shift in the images but instead a shift in the labels. We study the cross-domain performance, agreement between models, and model representations. We find interesting discrepancies between performance and agreement where models which both achieve good performance disagree in their predictions as well as models which agree yet achieve poor performance. We also test for concept similarity by regularizing a network to group tasks across multiple datasets together and observe variation across the tasks. All code is made available online and data is publicly available: \url{https://github.com/mlmed/torchxrayvision}


\end{abstract}

\begin{keywords}
X-rays diagnostic, deep learning, generalization
\end{keywords}

\section{Introduction}

This work studies the generalization performance of current chest X-rays prediction models when trained and tested on X-rays image datasets from different institutions that were annotated by different clinicians or labelling tools. By doing so, we aim to provide supporting evidence for which tasks are reliable/consistent across multiple different datasets.
Indeed, it seems there are limits to the performance of systems designed to replicate humans which is consistent with the evidence that human radiologists often don't agree with each other. Recent research has discussed generalisation issues \cite{Pooch2019, Yao2019op, Baltruschat2019} however it is not clear exactly what the cause of the problem is. We enumerate some possibilities:
\begin{itemize}[nosep]
%    \itemsep0em 
    \item Errors in labelling as discussed by \citet{Oakden-Rayner2019} and \citet{Majkowska2019}, in part due to automatic labellers.
    \item Discrepancy between the radiologist's vs clinician's vs automatic labeller's understanding of a radiology report \cite{Brady2012}.
    \item Bias in clinical practice between doctors  \cite{Busby2018} or limitations in objectivity \cite{Cockshott1983, Garland1949}.
    \item Interobserver variability \cite{Moncada2011}. It can be related to the medical culture, language, textbooks, or politics. Possibly even conceptually (e.g. footballs between USA and the world \includegraphics[height=10pt]{imgs/football-157930_640.png}\includegraphics[height=8pt]{imgs/american-football-311817_640.png}).
\end{itemize}

Formally we have pairs of X-ray images, $x_i$, and corresponding task labels, $y_i$, drawn from some joint distribution $p(x,y)$ for a given population.  Our learning methods estimate $p(y|x)$, but may not generalize well when the joint distribution changes due to, for example, different X-ray machines or variable patient characteristics between different populations.  There are several different cases that can give rise to variations in $p(x,y)$ and we will use the terminology of \cite{Moreno-Torres2012} to describe them.  Approaches for generalizing medical image models (e.g. \cite{Pooch2019}) have assumed $p(y|x)$ to be constant and concentrated on {\em covariate shift} (where $p(x)$ varies) and {\em prior probability shift} (where $p(y)$ varies).  We present evidence that $p(y|x)$ is not consistent and what is considered the ``ground truth'' is subjective; {\em concept shift} in the terminology of \cite{Moreno-Torres2012}. This forces us to consider $p(y|x,c)$ where $c$ conditions the prediction. Our experiments suggest that this conditioning is not only related to bias from the population but is due to other factors. This presents a new challenge to overcome when developing diagnostic systems as, under the current formulation, it may be impossible to train a system that will generalize.

To address this issue \citet{Majkowska2019} relabeled a subset of the NIH dataset images for 4 labels using 3 raters. On these images their raters didn't agree with each other for ``Airspace opacity'' 10\% of the time and ``Nodule/mass'' 6\% of the time\footnote{We calculate these statistics from the published file individual\_readers.csv. If there was not unanimous agreement between the 3 raters this is considered disagreement.}. When looking at NIH images which have been used in other datasets and relabelled for the same pathologies (Appendix Figure \ref{fig:nih-label-agreement}) we find generally poor agreement with NIH labels for positive predictions and F1 scores as low as 10\% (for Pneumonia).  The Kaggle and Google relabelings show better, but very far from perfect, agreement on the one category where they overlap (Opacity, F1: 73\%).

When creating the MIMIC-CXR dataset, \citet{Johnson2019mimic-cxr} used two different automatic label extraction methods. Between these methods the most disagreement was 0.6\% for ``Fracture'' (when only considering positive and negative labels) or 2.6\% for Cardiomegaly (when including uncertain and no prediction as well). They also evaluated a subset of the radiology reports with a board certified radiologists which found that a lowest agreement of 0.462 F1 for ``Enlarged Cardiomediastinum'' which can possibly be explained by uncertainty about what cardio-thoracic ratio (CTR) is clinically relevant \cite{Zaman2007}. 

These studies indicate that automatic labelling tools are consistent with each other and the issue likely is related to the well known problem of interobserver variability. In order to mitigate this problem we focus on studying its impact on the current Deep Learning approaches.


\textbf{Our approach:} In this work we analyze models trained on four of the largest public datasets utilizing over 200k unique chest X-rays after filtering for one AP or PA view per patient. A study like this is needed as these systems are being built and evaluated now \cite{Cohen2019, Qin2019, Baltruschat2019, Hwang2019, Rubin2018LateralMultiView, Yao2019op, Putha2018CQ100k}. This work is further motivated by the use of these models in populations much different than their training population such as in \cite{Qin2019} where systems such as qXR (developed in India) is applied to images from Nepal and Cameroon. 

There are many issues that could prevent a model from generalizing. For example: overfitting to artifacts of the training data \cite{Zech2018}, concepts can vary between training labels and external data, training data may not be a representative sample of external data, and the models could be learning very superficial image statistics \cite{Jo2017}.

%With the scale of data made available to researchers we have an opportunity to study this question quantitatively and at a large scale. 

The paper is structured into three sections: performance, agreement, and representation. The performance section \S \ref{sec:perf} studies performance of models trained on one dataset and evaluated on others. The agreement section \S \ref{sec:agree} studies how much predictions from models trained on one dataset agree with the predictions of other models trained using other datasets for the same task. Finally a representation section \S \ref{sec:rep} studies how well the representations in the neural networks differ between the models. All code is made available online\githuburl and data is publicly available.

\section{Data}


We use the following datasets: 
\textbf{NIH} aka Chest X-ray14 \cite{WangNIH2017}, 
\textbf{PC} aka PadChest  \cite{Bustos2019PadChest}, 
\textbf{CheX} aka CheXpert \cite{Irvin2019CheXpert},
\textbf{MIMIC-CXR}  \cite{Johnson2019mimic-cxr},
\textbf{OpenI} \cite{Demner-Fushman2016},
\textbf{Google}  \citep{Majkowska2019},
\textbf{Kaggle} aka the RSNA Pneumonia Detection Challenge\footnote{\url{https://www.kaggle.com/c/rsna-pneumonia-detection-challenge}}. Full details of the data are located in Appendix \S \ref{sec:datadetails_apdx}. 18 common labels were identified by manually reviewing the descriptions of the provided labels in each dataset. Code is provided which details the exact mapping online. We release a framework to load these datasets in a canonical way for further experimentation called torchxrayvision \cite{Cohen2020xrv}. To align the datasets we resize the images to $224 \times 224$ pixels as is standard for methods on these datasets \citet{Rajpurkar2017chexnet}. We did not want to confuse the issue by changing the architecture and strategy from previous work. The images are also center cropped if the aspect ratio is uneven (as to not stretch the images) and the pixel values are scaled between $[-1024, 1024]$ so that bit depth of the images is uniform.

\section{Models}

\label{sec:models}

DenseNets \cite{Huang2017} have been shown to be the best architecture for X-rays predictive models \cite{Rajpurkar2017chexnet}. Training was standard with other similar work. To take into account that only some labels are present with the recent 2019+ datasets the loss is computed only for the available labels and other outputs are ignored.  An ensemble of three models are trained for each dataset and the results averaged to reduce noise.

Due to label imbalance the performance for tasks which are overrepresented receive less focus by the loss function. In order to alleviate this the weight for each task is balanced based on the frequency of that task in the dataset. Each task $t$ is given a weight $w_t$ based on the following formula where $c_t$ is the count of samples with positive samples for task $t$ and $\bar{c}$ is the average count. The intuition here is that $\max_i(c_i) - c_t$ will be 0 for at least one task so $\bar{c}$ pushes up the minimum weight while $\frac{\alpha_t}{\max_i(\alpha_i)}$ normalizes this value to be between 0 and 1.
\begin{equation}
w_t = \frac{\alpha_t}{\max_i(\alpha_i)}
,\hspace{20pt}
\alpha_t = \max_i(c_i) - c_t + \bar{c} 
\end{equation}
In order to calibrate the output of the model so that they can be compared a piecewise linear transformation Eq. \ref{eq:normalize} is applied. The transformation is chosen so that the best operating point corresponds to $50\%$. For each disease, we computed the optimal operating point by maximizing the difference (\textit{True positive rate} - \textit{False positive rate}). It corresponds to the threshold which maximizes the informedness of the classifier \cite{powersEvaluation2011}. This is computed with respect to the test set being evaluated so the model is the most optimal it can be. With this we remove miscalibration as a reason for generalization error.
\begin{equation}
\label{eq:normalize}
f_{opt}(x) = \begin{cases} 
      \frac{x}{2opt} & x\leq opt \\
      1-\frac{1-x}{2(1-opt)} & otherwise
   \end{cases}
\end{equation}

It is important to note that Eq. \ref{eq:normalize} requires an operating point which is not the same across all datasets. %If calibrating the model only on the test set corresponding to that data there is a 1.2\% decrease in performance on average and up to a 30\% decrease in performance for labels such as Pneumonia, Infiltration, Lung Lesion, Fracture, and Enlarged Cardiomediastinum. 
For example if we calibrate on NIH data so a prediction of 0.5 is the optimal decision boundary (FPR TPR tradeoff) for that dataset and then apply the model to PADCHEST the optimal decision boundary will be different and possibly 0.8. This means a prediction of 0.79 should be considered negative, and the model should be calibrated using Eq. \ref{eq:normalize} so that 0.8 $\rightarrow$ 0.5.
%This echoes the models uncertainty and a need to overcome this challenge for deployment to avoid false positives/negatives. 
In this study we calibrate each model based on the test data it is evaluating at that moment in order to remove this issue from consideration and assume the model is operating with optimal calibration.  Each of the models in the ensemble is calibrated separately, and their calibrated output is averaged.

Data augmentation was used to improve generalization. According to best results in \citet{Cohen2019} (and replicated by us) each image was rotated up to 45 degrees, translated up to 15\% and scaled larger of smaller up to 10\%.


\section{Performance}

\label{sec:perf}

\begin{figure}[t]
    \centering
    \includegraphics[width=1\textwidth]{imgs/auc-all.pdf}
\caption{\small AUC of each model on each dataset. All valid combinations of model and dataset are computed where a model was trained on the specific label and that label exists in the target dataset. A white cell means it cannot be computed due to missing labels in train or test dataset. The outputs of 3 models are averaged together to reduce noise. Each of the 3 models is trained on the same data with different weight initialization. }
     \label{fig:auc}
\end{figure}

The most basic analysis to evaluate how well models generalize is to look at the performance outside their training data. In Figure \ref{fig:auc} a model is trained on each dataset's training subset and then evaluated on the other dataset's testing subsets. AUC is used to determine the performance per task as it accounts for imbalance in labels. Many combinations are not possible because the datasets do not overlap completely and we aim to include as many labels as possible. 

The experiments show the best generalization for the tasks Cardiomegaly, Edema, and Effusion. It also seems there is reasonable generalization for Atelectasis, Consolidation, Emphysema, Hernia, and Lung Opacity.
The worst generalization performance can be seen for Infiltration where it is inverted between the PC and NIH datasets. Pneumonia indicates that the NIH model performs poorly and other models also perform poorly on the NIH while performing well on other datasets. For Fracture all models applied to the hand labelled NIH\_Google dataset perform very poorly while much better on their own test set than others. Pneumothorax also indicates better performance on a models test set than others but does perform well on the hand labelled NIH\_Google dataset.





% \begin{table}
%     \centering
%     \begin{tabular}{c c c c}
%     \toprule
% Domain & Training domain only & All other domains & All domains  \\
% \midrule
% NIH& .76$\pm$0.01 & .76$\pm$0.01 & .80$\pm$0.01 \\
% PC & .85$\pm$0.01 & .78$\pm$0.01 & .86$\pm$0.01\\
% CheX   & .80$\pm$0.01 & .75$\pm$0.01 & .81$\pm$0.01\\
% MIMIC\_CH   & .83$\pm$0.01 & .79$\pm$0.01 & .86$\pm$0.02\\
%     \bottomrule
%     \end{tabular}
%     \caption{\small A leave one domain out evaluation. Here `` Training domain only'' represents a model trained on only the training data of the domain under test. ``All other domains'' represents a model trained on all domains except the domain under test. Finally ``All domains'' represents a model trained on all domains including the domain under test. The average AUC over all tasks are shown. Three seeds are used to initialize the model and the mean and stdev is shown.}
%     \label{tab:crossdomain}
% \end{table}

In Figure \ref{fig:auc} the ``All'' model which is trained on all datasets combined outperforms almost all other models (with the exception of Pneumonia on NIH). However, this result is not due to better generalization but is due to the inclusion of the training data which comes from the same domain. To verify this, similar to \citet{Yao2019op}, the performance on a test set is evaluated in Figure \ref{fig:crossdomain} when leaving the test set domain out of training.



\begin{figure}[h]
    \centering
    \includegraphics[width=1.0\textwidth]{imgs/cross-domain-auc.png}
\caption{\small A leave one domain out evaluation. Here blue represents a model trained on only the training data of the domain under test. The orange bar represents a model trained on all domains except the domain under test. Finally the green bar represents a model trained on all domains including the domain under test. The average AUC over all tasks are shown. Three seeds are used to initialize the models and the mean and stdev is shown.}
     \label{fig:crossdomain}
\end{figure}

\begin{figure}[h]
    \centering
    \includegraphics[width=1.0\textwidth]{imgs/kappa-all.png}
    \caption{\small Kappa inter-rater variability for pairwise comparisons given each testset. A white cell means it cannot be computed due to missing labels in train or test dataset. An ensemble of 3 models is used to reduce noise. For each task the Kappa score is calculated between the model trained on that data and other models which are trained to predict that task. }
    \label{fig:kappa}
\end{figure}

%\newpage
\section{Agreement}
\label{sec:agree}

We use the Cohen's Kappa score \cite{Cohen1960} in order to calculate the agreement between raters, or in our case networks trained on a specific dataset. $\kappa = \frac{p_o - p_e}{1 - p_e}$. A Kappa of 0 indicates only chance agreement and 1 indicates total agreement.  A Kappa of 40\% is considered moderate while 70\% is considered excellent \cite{Moncada2011}.

In Figure \ref{fig:kappa} agreement is poor for labels which are not common. 
%MIMIC\_CH and MIMIC\_NB are a special case for agreement as the images are the same but the ground truth labels are extracted using different methods. 
Agreement between the models themselves when trained with different seeds is between 75\% and 86\% Kappa (in appendix Figure \ref{fig:seedkappa:self}) indicating that is the upper bound. The tasks are ordered from left to right based on their generalization performance as evaluated in \S \ref{sec:perf}. An unexpected finding is that Cardiomegaly is generally in agreement except by the NIH model which seems to perform well except for the MIMIC\_CH dataset. These results are concerning as models can disagree yet still perform well. 

Some tasks can disagree yet achieve high AUC which others have strong agreement yet have low AUC. The outputs between two such models and tasks are studied in Figure \ref{fig:ba}. 


\begin{figure}[]
\floatconts
  {fig:ba}% label for whole figure
  {\caption{\small Bland Altman plots showing agreement of the model outputs. The red line indicates where optimal agreement should be. The model outputs are calibrated so that 0.5 is the operating point of the AUC and therefore is the optimal threshold. This calibration causes the diamond artifacts when plotted.}}
  {      
  \subfigure[Good performance but poor agreement]{%
      \label{fig:ba:hernia}%
      \includegraphics[width=0.49\textwidth]{imgs/ba-pc-nih-hernia.png}
    }%
  \subfigure[Bad performance but high agreement]{%
      \label{fig:ba:nodule}%
      \includegraphics[width=0.49\textwidth]{imgs/ba-pc-nih-nodule.png}
    }
  }
\end{figure}

\section{Representation}
\label{sec:rep}

\begin{figure}[t]
    \centering
\includegraphics[width=1.0\textwidth]{imgs/weight-scatter-merge.pdf}
    % \includegraphics[width=1\textwidth]{imgs/weight_scatter.png}
    % \includegraphics[width=1\textwidth]{imgs/weight_scatter_leg.png}
    % \includegraphics[width=1\textwidth]{imgs/weight_scatter_reg.png}
    \caption{\small Two models trained so that each output represents a single dataset-task combination resulting in a weight vector for each. A PCA of these weight vectors is shown. The left shows the normal training case while the right shows the result when trained with regularization so that all vectors for the same task are similar (L2 distance). }
    \label{fig:weight_scatter}
\end{figure}

\begin{figure}
    \centering
    \includegraphics[width=0.95\textwidth]{imgs/weight_bar.png}
    \caption{\small Distances between regularized and non-regularized models. Tasks are sorted by their distance while regularized. The average and standard deviation of 3 models trained with different seeds and data splits is shown.}
    \label{fig:weight_bar}
\end{figure}

We can also look at how the representation in a model changes between training datasets. In this experiment we train a network which has an output that represents a single dataset-task combination resulting in a weight vector for each (5 datasets $\times$ 18 tasks = 90 outputs). With this approach each image is processed into a feature vector of dimension 1024. A classifier layer is applied to these feature vectors, where each task output is determined by a sigmoid (logistic function). These vectors are updated in order to improve a single task and are therefore independent of each other (ignoring  transfer learning via multitask training). During training each vector is updated only with respect to their datasets. If these weight vectors are the same between two tasks then their predictions will be identical. Because the logistic function is a relatively linear transformation the distance between these vectors is meaningful and can explain similarity between tasks.



In Figure \ref{fig:weight_scatter} the first 2 components of the Principal component analysis (PCA) \cite{Pearson1901} are plotted for every domain-task vector. This is a linear dimensionality reduction method so distances are real unlike a t-SNE \cite{VanDerMaaten2009b}. We can observe some similarity of the tasks such as Cardiomegaly and Effusion but generally the vectors are very different. 

We can add an L2 regularizer that encourages the weight vectors of the same task to be close to each other. This is added to the objective function so the model is simultaneously learning to make predictions while it is trying to align these weight vectors. In the lower figure of Figure \ref{fig:weight_scatter} the results of training with this regularization are shown. We can see that even with this pressure to align weight vectors, some tasks do not merge into a single vector as  Mass, Nodule, Fibrosis, Lung Lesion, and Pleural Thickening. 

The more variation between these task vectors the more evidence that for the same feature vector a different prediction must be made. This implies that the differences between the datasets during training have caused the network to diverge in its representation of a task and produce different results. These differences are viewed explicitly in Figure \ref{fig:weight_bar} where the differences between weight vectors have been averaged over 3 seeds and normalized relative to the other tasks. 


\section{Discussion}

This work presents evidence that the community may want to focus on concept shift over covariate shift in order to improve generalization of chest X-ray prediction models. If covariate shift was only present then it is unexpected that we would observe over half of the tasks perform well while the remaining have very variable results. Our results, specifically the discrepancy between model prediction agreement and performance, raise more questions that warrant further study. 

In order to address this problem it seems that better automatic labeling may not be the solution as the bias is likely at the level of different schools of thought, general disagreement between radiologists, and subjectivity in what is clinically relevant to include in a report.

If these networks are anything like doctors then discrepancy, difference of opinion, and errors are unavoidable \cite{Siegle1998, Brady2012, Brady2017, Soffa2004}. As these models are only trained to capture the conditional distribution defined by the training distributions they will carry with them the bias of the data. When building these into tools which influence clinical outcomes we shouldn't accept that model predictions reflect our own idea of a medical concept. We should consider each task prediction as defined by its training data such as ``NIH Pneumonia''.  One can present the output of multiple models to a user with information about the specific context and origin of that model.

We assert that a solution is not to train on a local data from a hospital that the tool will be deployed in. We have shown that even though a model trained using all datasets performs well it does not reflect true generalization performance. It follows that we should not be fine-tuning models on local distributions as it is likely only adapting to the local biases in the data which may not match the reality in the images.

\section{Limitations}

Only labels associated with each dataset are used and the outcomes of the patients are not considered. This would be relevant for establishing the risk of disagreement for specific tasks. We only use the AP/PA views and ignore the lateral views which many contain needed features of a finding as discussed by \citet{Bertrand2019lateralviews} and \citet{Hashir2020Lateral}.

%\newpage

% Acknowledgments---Will not appear in anonymized version
\midlacknowledgments{We thank Mathieu Germain, Chin-Wei Huang, Karsten Roth, Joseph Viviano, Lan Dao, Ronald Summers, Yoshua Bengio, Gabriele Prato, and Michaël Chassé for their feedback. We also specifically thank Chin-Wei Huang for help selecting perfect icons for Figure 5. This work utilized the supercomputing facilities managed by Compute Canada and Calcul Quebec. We thank AcademicTorrents.com for making data available for our research.}

\bibliography{cohen20}

\newpage
\appendix
\counterwithin{figure}{section}
\counterwithin{table}{section}
%\section{}
\renewcommand\thefigure{\thesection.\arabic{figure}}   

\section{Data details}
\label{sec:datadetails_apdx}

\begin{itemize}

\item \textbf{NIH} (30k images) From the Clinical Center, Bethesda, Maryland, USA. The Chest-X-rays14 dataset released by the NIH \cite{WangNIH2017}. It was automatically labelled using the NegBio labeller. 

\item \noindent\textbf{PC} (62k images) Aka PadChest, from Hospital San Juan de Alicante, Alicante, Spain \cite{Bustos2019PadChest}. The images have been labeled with various types of radiological findings and differential diagnoses, with 27\% of the annotations created manually by physicians and the rest extracted from the report by an RNN. Since the PadChest dataset defines a hierarchy of labels, we mapped the labels to their respective top level. 


\item \noindent\textbf{CheX} (64k images) Aka CheXpert, from the Stanford Hospital, Palo Alto, CA, USA \cite{Irvin2019CheXpert}. This dataset introduces a custom labeller called the ``CheXpert labeler''.

\item \noindent\textbf{MIMIC-CXR} (45k images) From the Beth Israel Deaconess Medical Center in Boston, MA, USA \cite{Johnson2019mimic-cxr}. Labels were extracted and are provided from two automatic labellers, both the CheXpert and the NIH NegBio labeller. MIMIC\_CH refers to the CheXpert labeller and MIMIC\_NB refers to the NIH NegBio labeller.

\item \noindent\textbf{OpenI} (3267 images) From the Indiana University hospital network \cite{Demner-Fushman2016}. The MeSH automatic labeller was used.

\item \noindent\textbf{Google} (1695 images) Images from the NIH data were relabeled manually \citep{Majkowska2019} for 4 labels. We don't use the ``mass/nodule'' label as it does not align with our standardization of labels.

\item \noindent\textbf{Kaggle} (30227 images) From the Kaggle Pneumonia Detection Challenge\footnote{https://www.kaggle.com/c/rsna-pneumonia-detection-challenge}. Each image was hand labelled by a single radiologist for the presence of lung opacity. This label is included as both Lung Opacity and Pneumonia.
\end{itemize}

Detailed dataset information is in Appendix Table \ref{tab:counts}.

\begin{table}[h]
    \centering
\resizebox{\columnwidth}{!}{%
\begin{tabular}{ccccccccc}
\toprule
Dataset &         NIH &          PC &         CheX &    Google &    MIMIC\_CH &    MIMIC\_NB &     OpenI &      Kaggle \\
\midrule
Atelectasis                &  1702/29103 &  2441/59674 &  12691/14317 &         - &  4077/30954 &  4048/32058 &  271/2996 &           - \\
Cardiomegaly               &   767/30038 &  5390/56725 &   9099/17765 &         - &  3743/32312 &  3275/33431 &  185/3082 &           - \\
Consolidation              &   427/30378 &   494/61621 &   5390/22504 &         - &   816/32297 &   762/33564 &         - &           - \\
Edema                      &    82/30723 &   108/62007 &  14929/20615 &         - &  1157/33610 &  1121/34731 &   50/3217 &           - \\
Effusion                   &  1280/29525 &  1637/60478 &  20640/23500 &         - &  3713/33401 &  3595/34489 &  120/3147 &           - \\
Emphysema                  &   265/30540 &   546/61569 &            - &         - &           - &           - &   84/3183 &           - \\
Enlarged Cardio &           - &           - &   5181/20506 &         - &   692/31505 &   660/32641 &         - &           - \\
Fibrosis                   &   571/30234 &   341/61774 &            - &         - &           - &           - &   17/3250 &           - \\
Fracture                   &           - &  1665/60450 &   4250/14948 &   60/1635 &   972/30961 &   696/32320 &   78/3189 &           - \\
Hernia                     &    83/30722 &   988/61127 &            - &         - &           - &           - &   41/3226 &           - \\
Infiltration               &  3604/27201 &  4438/57677 &            - &         - &           - &           - &   66/3201 &           - \\
Lung Lesion                &           - &           - &   4217/14422 &         - &  1321/31033 &  1271/32187 &    3/3264 &           - \\
Lung Opacity               &           - &           - &  30873/15675 &  601/1094 &  5426/31175 &  5301/32371 &  327/2940 &  9555/20672 \\
Mass                       &  1280/29525 &   507/61608 &            - &         - &           - &           - &    6/3261 &           - \\
Nodule                     &  1661/29144 &  2194/59921 &            - &         - &           - &           - &   68/3199 &           - \\
Pleural\_Thickening         &   763/30042 &  2076/60039 &            - &         - &           - &           - &   30/3237 &           - \\
Pneumonia                  &   168/30637 &  2051/60064 &   2822/14793 &         - &  2176/33347 &  2042/34479 &   68/3199 &  9555/20672 \\
Pneumothorax               &   269/30536 &    98/62017 &   4311/32685 &   72/1623 &   560/33651 &   500/34760 &   14/3253 &           - \\
\bottomrule
\end{tabular}
}
    \caption{Counts of samples in datasets. What is shown is positive/negative. some datasets omit labels while others have a negative value for each dataset. }
    \label{tab:counts}
\end{table}

\begin{figure}
\floatconts
  {fig:nih-label-agreement}% label for whole figure
  {\caption{Label agreement between different datasets which use NIH images. Samples from the NIH dataset were relabelled in the Kaggle and Google datasets. The Google dataset explicitly lists the corresponding NIH image, while the Kaggle dataset could be rematched based on pixel similarity. This figure shows the confusion matrices for images which were labelled by two of the datasets.}}
  {  
    \subfigure[Pneumonia \hspace{20px} F1:10\%]{%
      \includegraphics[width=0.25\textwidth]{imgs/pneumonia-kaggle-nih.png}
    }%
    \subfigure[Lung Opacity \newline F1:73\%]{%
      \includegraphics[width=0.25\textwidth]{imgs/opacity-kaggle-google.png}
    }%
    \subfigure[Pneumothorax \newline F1:45\%]{%
      \includegraphics[width=0.25\textwidth]{imgs/pneumothorax-nih-google.png}
    }%
    \subfigure[Nodule/Mass \newline F1:48\%]{%
      \includegraphics[width=0.25\textwidth]{imgs/nodulemass-nih-google.png}
    }%
  }
\end{figure}

\begin{figure}[h]
    \centering
    \includegraphics[width=1\textwidth]{imgs/kappa-self.png}
    \caption{Kappa Inter-rater variability for pairwise comparisons given each model over the 3 seeds.}
    \label{fig:seedkappa:self}
\end{figure}

\begin{figure}[h]
    \centering
    \includegraphics[width=0.6\textwidth]{imgs/openi-pa.png}
    \caption{t-SNE of features extracted from OpenI images in order to determine PA view from lateral view images.}
    \label{fig:openi-pa}
\end{figure}

\begin{figure}[h]
    \centering
    \includegraphics[width=0.7\textwidth]{imgs/vectors-auc.png}
    \caption{\small Relationship between generalization performance and similarity between weight vectors.}
    \label{fig:vectors-auc}
\end{figure}


\begin{figure}[h]
\floatconts
  {fig:example-infiltration}% label for whole figure
  {\caption{Images most in disagreement for label Infiltration. Left: NIH model predicts higher, Right: PC model predicts higher. Top row is NIH dataset images and bottom row is from PC. All images are labelled as Infiltration for their respective dataset. The probability of each model is shown below the image. The outputs are calibrated so 50\% is the operating point for each model.}}
  {  \subfigure[{\footnotesize{NIH-Label:True \newline NIH:61\%,PC:42\%
     \newline \scalebox{.5}{00021361\_000.png}}}]{%
      \includegraphics[width=0.22\textwidth]{xrays/00021361_000.png}
    }%
    \subfigure[{\footnotesize{NIH-Label:True \newline NIH:60\%,PC:30\%
     \newline \scalebox{.5}{00000044\_000.png}}}]{%
      \includegraphics[width=0.22\textwidth]{xrays/00000044_000.png} %12227
    }%\hspace{20pt}
    \subfigure[{\footnotesize{NIH-Label:False \newline NIH:16\%,PC:69\%
     \newline \scalebox{.5}{00002997\_000.png}}}]{%
      \includegraphics[width=0.22\textwidth]{xrays/00002997_000.png}
    }%
    \subfigure[{\footnotesize{NIH-Label:False \newline NIH:35\%,PC:85\%
     \newline \scalebox{.5}{00009259\_000.png}}}]{%
      \includegraphics[width=0.22\textwidth]{xrays/00009259_000.png}
    }%
    %\vspace{20px}
    
    \subfigure[{\footnotesize{PC-Label:False \newline NIH:63\%,PC:2\%
     \newline \scalebox{.3}{216840111366964013402131755672012186105344221\_01-068-021.png}}}]{%
      \includegraphics[width=0.22\textwidth]{xrays/216840111366964013402131755672012186105344221_01-068-021.png} %19968
    }%
    \subfigure[{\footnotesize{PC-Label:False \newline NIH:61\%,PC:8\%
     \newline \scalebox{.3}{216840111366964012487858717522009226103118712\_00-003-139.png}}}]{%
      \includegraphics[width=0.22\textwidth]{xrays/216840111366964012487858717522009226103118712_00-003-139.png}
    }%\hspace{20pt}
    \subfigure[{\footnotesize{PC-Label:True \newline NIH:24\%,PC:56\%
     \newline \scalebox{.3}{36164007903935514481435557937719104961\_yx8eii.png}}}]{%
      \includegraphics[width=0.22\textwidth]{xrays/36164007903935514481435557937719104961_yx8eii.png}
    }%
    \subfigure[{\footnotesize{PC-Label:True \newline NIH:22\%,PC:83\%
     \newline \scalebox{.3}{92102097721666959716631439717880101953\_jrv54v.png}}}]{%
      \includegraphics[width=0.22\textwidth]{xrays/92102097721666959716631439717880101953_jrv54v.png} % 2935
    }%
  }
\end{figure}


\begin{figure}[]
\floatconts
  {fig:example-hernia}% label for whole figure
  {\caption{Images most in disagreement for Hernia in the PC and NIH datasets. Left: NIH model predicts higher, Right: PC model predicts higher. The probability of each model is shown below the image. The outputs are calibrated so 50\% is the operating point for each model.}}
  {  \subfigure[{\footnotesize{NIH-Label:True \newline NIH:51\%,PC:17\%
     \newline \scalebox{.5}{00007887\_000.png}}}]{%
      \includegraphics[width=0.22\textwidth]{xrays/00007887_000.png} % 10540
    }%
    \subfigure[{\footnotesize{NIH-Label:True \newline NIH:50\%,PC:9\%
     \newline \scalebox{.5}{00014005\_000.png}}}]{%
      \includegraphics[width=0.22\textwidth]{xrays/00014005_000.png} %8667
    }%\hspace{20pt}
    \subfigure[{\footnotesize{NIH-Label:False \newline NIH:21\%,PC:82\%
     \newline \scalebox{.5}{00029875\_000.png}}}]{%
      \includegraphics[width=0.22\textwidth]{xrays/00029875_000.png} % 5618
    }%
    \subfigure[{\footnotesize{NIH-Label:False \newline NIH:27\%,PC:93\%
     \newline \scalebox{.5}{00017162\_000.png}}}]{%
      \includegraphics[width=0.22\textwidth]{xrays/00017162_000.png} % 14880
    }%
    %\vspace{20px}
    
    \subfigure[{\footnotesize{PC-Label:False \newline NIH:50\%,PC:39\%
     \newline \scalebox{.3}{216840111366964013076187734852011228185752527\_00-112-176.png}} }]{%
      \includegraphics[width=0.22\textwidth]{xrays/216840111366964013076187734852011228185752527_00-112-176.png} %19968
    }%
    \subfigure[{\footnotesize{PC-Label:False \newline NIH:50\%,PC:28\%
     \newline \scalebox{.3}{216840111366964013686042548532013164084535605\_02-099-093.png}} }]{%
      \includegraphics[width=0.22\textwidth]{xrays/216840111366964013686042548532013164084535605_02-099-093.png}
    }%\hspace{20pt}
    \subfigure[{\footnotesize{PC-Label:True \newline NIH:52\%,PC:90\%
     \newline \scalebox{.3}{216840111366964013590140476722013042113955260\_02-067-187.png}} }]{%
      \includegraphics[width=0.22\textwidth]{xrays/216840111366964013590140476722013042113955260_02-067-187.png}
    }%
    \subfigure[{\footnotesize{PC-Label:True \newline NIH:50\%,PC:90\%
     \newline \scalebox{.3}{216840111366964014008416513202014183130928200\_01-164-038.png}} }]{%
      \includegraphics[width=0.22\textwidth]{xrays/216840111366964014008416513202014183130928200_01-164-038.png} % 2935
    }%
  }
\end{figure}


\begin{figure}[]
\floatconts
  {fig:example-fracture}% label for whole figure
  {\caption{Images most in disagreement for Fracture in the MIMIC-CXR dataset. All images have the ground truth labelling CH-Label:True NB-Label:False. Left: NB model predicts higher, Right: CH model predicts higher. The probability of each model is shown below the image.  CH indicates the probability output by the model trained using the CheXpert labels and NB indicates the probability output by the model trained using the NegBio labels. The outputs are calibrated so 50\% is the operating point for each model. There was only one sample where the NB label was true and CH label was false and it is not shown as the networks both strongly predicted a negative score.}}
  {  \subfigure[\footnotesize{CH:12\%,NB:51\% 
     \newline \scalebox{.35}{3f2e38d3-07b10866-3e3d01da-6c343983-5f755c74.jpg} }]{%
      \includegraphics[width=0.23\textwidth]{xrays/3f2e38d3-07b10866-3e3d01da-6c343983-5f755c74.jpg}
    }%
    \subfigure[\footnotesize{CH:14\%,NB:51\% 
    \newline \scalebox{.35}{2cf717d2-9314f52f-734d718b-2a600f6c-3ed5999a.jpg}} ]{% 3118
      \includegraphics[width=0.23\textwidth]{xrays/2cf717d2-9314f52f-734d718b-2a600f6c-3ed5999a.jpg} %3008
    }%\hspace{2pt}
    \subfigure[\footnotesize{CH:50\%,NB:6\%
    \newline \scalebox{.35}{0365ada9-cd9764cf-a538faad-6f9f3f35-976e50a3.jpg}} ]{% 
      \includegraphics[width=0.23\textwidth]{xrays/0365ada9-cd9764cf-a538faad-6f9f3f35-976e50a3.jpg}
    }%
    \subfigure[\footnotesize{CH:50\%,NB:8\%
    \newline \scalebox{.35}{574d988c-fc8b10ea-ed2a4270-9e90be70-dcd274a8.jpg}} ]{% 
      \includegraphics[width=0.23\textwidth]{xrays/574d988c-fc8b10ea-ed2a4270-9e90be70-dcd274a8.jpg} % 13858
    }%
 
  }
\end{figure}

\end{document}
