\RequirePackage{etoolbox}
\patchcmd{\bibliographystyle}{#1}{midl-nopagenum}{}{}

\documentclass{midl} % Include author names
%\documentclass[anon]{midl} % Anonymized submission
\usepackage{soul}
\usepackage{booktabs}
\usepackage{multirow}
\usepackage{colortbl}
\usepackage{hyperref}
% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution

\usepackage{mwe} % to get dummy images
\jmlryear{2020}
\jmlrworkshop{Full Paper -- MIDL 2020}

%\editors{Under Review for MIDL 2020}




% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution
\usepackage{tikz}
\def\checkmark{\tikz\fill[scale=0.4](0,.35) -- (.25,0) -- (1,.7) -- (.25,.15) -- cycle;} 

\title[]{Quantifying the Value of Lateral Views\\ in Deep Learning for Chest X-rays}



 % Use \Name{Author Name} to specify the name.
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % Two authors with the same address
 % \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\and
 %  \Name{Author Name2} \Email{xyz@sample.edu}\\
 %  \addr Address}

\midlauthor{\Name{Mohammad Hashir\nametag{$^{1,2}$}\midljointauthortext{Contributed equally}} \Email{mohammad.hashir.khan@umontreal.ca}\\
%\AND
\Name{Hadrien Bertrand\nametag{$^{1}$}\midlotherjointauthor} \Email{hadrien.bertrand@mila.quebec}\\
%\AND
\Name{Joseph Paul Cohen}\nametag{$^{1,2}$} \Email{joseph@josephpcohen.com}\\
\addr {$^{1}$}Mila, Quebec Artificial Intelligence Institute\\
\addr {$^{2}$}Universit\'{e} de Montr\'{e}al
}

% Authors with different addresses:
% \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\\
% \addr Address 1
% \AND
% \Name{Author Name2} \Email{xyz@sample.edu}\\
% \addr Address 2
% }

%\footnotetext[1]{Contributed equally}

% Color definitions from: http://latexcolor.com/
\definecolor{alizarin}{rgb}{0.82, 0.1, 0.26}
\definecolor{azure}{rgb}{0.0, 0.5, 1.0}

% Hyperref setup
% \hypersetup{
%     linktoc=all,
%     linktocpage=true,
%     colorlinks=true,
%     citecolor=azure,
%     linkcolor=azure,
%     urlcolor=alizarin
% }

% \newcommand{\todo}[1]{\hl{TODO #1}}


\begin{document}

\maketitle

\begin{abstract}
Most deep learning models in chest X-ray prediction utilize the posteroanterior (PA) view due to the lack of other views available. PadChest is a large-scale chest X-ray dataset that has almost 200 labels and multiple views available. In this work, we use PadChest to explore multiple approaches to merging the PA and lateral views for predicting the radiological labels associated with the X-ray image. We find that different methods of merging the model utilize the lateral view differently. We also find that including the lateral view increases performance for 32 labels in the dataset, while being neutral for the others. 
The increase in overall performance is comparable to the one obtained by using only the PA view with twice the amount of patients in the training set.
\end{abstract}

\begin{keywords}
convolutional neural networks, chest x-rays, lateral views, multi-label classification
\end{keywords}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}
\label{sec:intro}

Large scale public chest X-ray datasets usually have had only the posteroanterior (PA) view available (e.g. the ChestX-ray14 \citep{WangNIH2017} dataset). This has allowed the development of several convolutional neural networks (CNN) based approaches that are built to use only the PA view for automatic prediction \citep{Rajpurkar2017_s,yao2017learning, Rajpurkar2018Plos, li2018thoracic, Cohen2019}. Other views, such as the lateral (L) view, are not commonly acquired as they are difficult to read without specific training \citep{Feigin2010} and only considered useful for specific diagnoses. The lateral view, in particular, is now usually replaced by a CT scan which is only ordered if the PA view is insufficient to diagnose. This practice delays any diagnosis or other actions as the patient would typically need to schedule another appointment. It also increases risk of exposure to larger doses of radiation used in performing CT scans.

There are specific cases in which the lateral view provides information for diagnosis that isn't clear or visible on the PA view \citep{shiraishi2007computer, Feigin2010, ITTYACHEN2017257}. For example, up to 15\% of the lung can be obscured by cardiovascular structures and the diaphragm \citep{raoof2012interpretation}. It is unclear if the information is completely missing from the PA view, or if it is present but in a way that makes it too difficult for a human to read. In the first case, the lateral view becomes relevant for making a prediction but a sufficiently advanced model could make it redundant in the second. 

This question was previously challenging to answer due to the lack of public large scale datasets of paired PA and L views. But the previous year has seen the release of sizeable de-identified chest X-ray datasets containing multiple views, namely CheXpert \cite{chexpert} \& MIMIC-CXR \cite{mimic-cxr} from the United States and PadChest \cite{Bustos2019-dt} from Spain. This provides us with an opportunity to explore the usefulness of the lateral view. In this work, we analyse the efficacy of multi-view models on the PadChest dataset. Preliminary work by \citet{Bertrand2019} suggests there is value in predicting from a lateral image (with a single view model) for certain radiological labels. We choose PadChest for our analysis due to the variety of radiological labels: it has a total of 194 distinct labels compared to 14 in the other two. This enables a much more fine-grained analysis of how the lateral view can contribute for chest X-ray prediction. We investigate the following questions ---
\begin{itemize}
    \item \textit{Is there a benefit of also including the lateral view in a prediction model? If so, in which cases specifically?}

    \item \textit{Is there a trade-off between training on PA views of a large amount of patients and training on paired PA \& L views of a smaller amount of patients?}

\end{itemize}


The structure of the paper is as follows. We provide a brief overview of multi-view CNNs in radiology in \S\ref{sec:relwork} and a description of PadChest and the preprocessing we used in \S\ref{sec:data}. We present the models we used in \S\ref{sec:models}. The experiments and results are shown in \S\ref{sec:exp}, and finally we address these questions with our findings in \S\ref{sec:conclusion}.




\section{Related work}\label{sec:relwork}

Using multiple views can help in increasing detection performance in radiology. \citet{Setio2016-fq} achieved an increase of around 54\% and 70\% in baseline metrics when the number of views was increased to 3 and 9 respectively from 1 in a pulmonary nodule detection task. \citet{Shachor2019} found that their `Mixture of Views' models achieved higher metrics than single view methods for breast cancer classification and brain MRI segmentation tasks. Multi-view networks also increased performance in tasks like mammogram classification \citep{Geras2017, Carneiro2017-qk, TrentKyono2019, Nasir_Khan2019-pt}, emphysema classification \citep{Bermejo-Pelaez2018-xg}, lung segmentation \citep{El-Regaily2019-qp}, fracture detection \citep{Kitamura2019-yx} and lesion detection \citep{Li2019-wn}.

These works provide a trend of a gain in performance when multiple views are used, which motivates our work. The use of the lateral view in deep learning on chest X-rays has been limited. \citet{Rajkomar2017-im} tried to predict whether an X-ray was a frontal view or lateral on a dataset from California. \citet{Rubin2018-ns} is one of the initial works that assessed whether combining the frontal and lateral views would help the network. They found that combining the PA and L views led to an increase of 3\% in the average AUC over all labels and improved the performance for 12 of the 14 labels in MIMIC-CXR. 

There has not been any assessment of multi-view models on the PadChest dataset, to the best of our knowledge. While \citet{Yao2019op} performed a preliminary benchmark on a simple binary classification (normal/abnormal), it used only the PA view. \citet{Bertrand2019} observed that the lateral views did contain useful information for some prediction tasks on PadChest which is the primary motivation for our analysis on this dataset.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Data and preprocessing}
\label{sec:data}

We use the PadChest \citep{Bustos2019-dt} dataset in our analysis. It is comprised of 160,000 chest X-rays and reports gathered from a Spanish hospital spanning over 67,000 patients with multiple visits and views available. The images have been annotated with various types of radiological findings and differential diagnoses.

We extract the set of patients who have a paired PA and lateral view available forming a total of 30,699 patients. We use only the first study from every patient and discard any additional ones. We resize the images to $224 \times 224$ pixels, utilizing a center crop if the aspect ratio is uneven, and scale the pixel values to $[-1, 1]$ for the training. Each visit can have any number of labels from the total of 194. Since the PadChest dataset defines a hierarchy of labels, we mapped the labels to their respective top level one, in order to maximize the number of images for each label. From those top level labels, we retain only those occurring in at least 50 patients which reduces our set of labels to 64. Some of them are of low clinical interest, such as ``electrical device", however they provide a sanity check on the results of the models. The numbers of samples per label we have kept is shown in Figure~\ref{fig:betterlateral}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Models}
\label{sec:models}



% \subsection{Models}
All models are built with four dense blocks following the same configuration as DenseNet-121 \cite{Huang2017, Rajpurkar2017_s}. Along with one type of single view model, we test four different methods of combining both views. 

\begin{enumerate}
      \setlength\itemsep{0.2ex}
    \item \textbf{DenseNet}:
    This is a standard DenseNet-121 with its fully connected layer modified for a multi-label output. It is used as a single view model and serves as the baseline. We use two DenseNets, each trained and tested on the PA and L views respectively. 
    
    \item \textbf{Stacked}:
    The architecture is essentially the same as a standard DenseNet-121 but the lateral view is stacked as a second channel of the PA image. It is the simplest way to combine the views before giving them to a model. This model is not designed to work if only one view is available.
    
    \item \textbf{DualNet} \citep{Rubin2018-ns}:
    The PA and L views are processed by two separate CNNs composed of four DenseNet blocks each. Their output feature maps are then passed through a global average pooling layer, concatenated and given to a fully connected layer that maps them to the output labels. Similar to Stacked, this model also relies on both views to be available to make a prediction.
    
    \item \textbf{HeMIS} \citep{Havaei2016}:
    The HeMIS architecture involves propagating each input view/modality through its separate set of layers, combining them by calculating pixelwise statistics and propagating further for classification. We create a HeMIS-style architecture where the PA and L views are processed separately by two separate CNNs `branches' composed of the first three DenseNet blocks each. The mean and variance of the output feature maps of both CNNs are computed pixelwise which are then concatenated and given to another dense block. The output of this block is passed through a global average pooling layer and given to a fully connected layer that maps them to the output classes. This model works even if only one view is available, with variance of the feature maps substituted with a tensor of zeroes of the same shape.
    
    \item \textbf{AuxLoss}:
    We modify DualNet by adding two separate fully connected layers to the PA and L branches respectively that map it to the same output labels and train all three losses jointly. Other than concatenating the pooled vectors of the branches, we also add an option to average them. Inspired by \citet{LeeDeeplySupervised2015} (we classify the output of each branch rather than layer) and \citet{TrentKyono2019}, this could be considered as a DualNet regularized by auxiliary view-specific losses.  This model is designed to be robust to missing views; if a view is not present the prediction is calculated off the available view's fully connected layer.
    
\end{enumerate}

\begin{figure}[h]
    \centering
    \includegraphics[width=1\linewidth]{imgs/diagrams.pdf}
    \caption{High level topology of the multi-view models used. PA and L indicate the input images while ``loss'' indicates where the prediction error is computed. Arrows indicate the computation flow during the forward pass.}
    \label{fig:diagrams}
\end{figure}

\section{Experiments}
\label{sec:exp}

We train all models for 40 epochs with Adam and a batch size of 8. Other hyperparameters differ for the models. In addition, we use curriculum learning with HeMIS and AuxLoss denoted by the `-CL' suffix hereon. A description of our training details including hyperparameters and how we implement curriculum learning is given in Appendix \ref{hypersec}.

\subsection{Performance}\label{perform}
We evaluate the multi-view models using the AUC (averaged over the 64 labels) achieved in three configurations of the test set. The first is the general case where paired views are available for all patients. The other two are the cases when only a single view is available: either only the PA view is present or the L view.  For models that require both views to be available (Stacked and DualNet), we use a zero matrix in place of the missing view. With this testing regime, we want to see how the multi-view models perform with only one view relative to the respective single view model. We perform five runs of every model with a different random split every time and report the results, averaged over all runs, in \tableref{tab:results}.

It is apparent that there is very little difference between the multi-view models and most of the models' confidence intervals overlap when both views are available. We perform t-tests between pairs of all models' performances and find that only two pairs have differences that are statistically significant. AuxLoss-CL achieves a significantly better performance than DualNet ($p=0.003$) and Stacked ($p=0.017$) but not the rest. A very interesting observation is that the Stacked model performs as well as the rest, despite the lack of a spatial correspondence between the channels of the input. 

The multi-view models diverge in performance when given only a single view. In the case of testing on only PA images of patients in the test set, DualNet emerges as a clear winner. Its confidence interval seems very close to that of Stacked, AuxLoss and AuxLoss-CL but a t-test proves the difference is statistically significant ($p=0.012$, $p=0.002$, $p=0.001$ respectively). The AUC of the single view DenseNet-PA lies between DualNet and Stacked and we find that the two multi-view models are not significantly better or worse respectively. Curriculum learning also seems to not have a discernible effect, in fact HeMIS with curriculum learning is much worse than a vanilla HeMIS. 
When evaluating on only lateral images, there is a large change in the performance. While the AuxLoss models experience a 5-7\% decrease in AUC relative to paired images, the other models deteriorate by more than 20\%. The single view model is better than the multi-view in this case. 


\begin{table}[!h]
    \centering
    \small
    \renewcommand{\arraystretch}{1.3} 
    \caption{\small Test AUC achieved by the different models, averaged over five runs with standard deviation also reported. DenseNet is trained on a single view denoted by the suffix, rest are trained on both views. The symbol in the superscript indicates the difference between a pair of models in the same column is statistically significant.}\label{tab:results}
    \begin{tabular}{l@{\hskip 0.7in}ccc}
    
    	\toprule
    	\small
    	\multirow{2}{*}{\textbf{Model}}  &                     \multicolumn{3}{c}{\textbf{Test AUC}}                     \\
    	                                  &             Both             &             PA             &         L         \\ \midrule
    	DenseNet-L                        &             ---              &            ---             & $0.780 \pm 0.004$ \\
    	DenseNet-PA                       &             ---              &     $0.793 \pm 0.007^{\ \ \ }$      &        ---        \\ \arrayrulecolor{black!20}\midrule
    	Stacked                           &    $0.804 \pm 0.003^{*\ }$     &     $0.786 \pm 0.009^{\ddagger\ \ }$      & $0.595 \pm 0.046$ \\
    	DualNet                           & $0.801 \pm 0.003^{\dagger\ }$  & $\mathbf{0.800 \pm 0.004}^{\dagger*\ddagger}$ & $0.539 \pm 0.018$ \\
    	HeMIS                             &      $0.803 \pm 0.006^{\ \ }$       &     $0.758 \pm 0.014^{\ \ \ }$      & $0.603 \pm 0.044$ \\
    	HeMIS-CL                          &      $0.803 \pm 0.007^{\ \ }$       &     $0.723 \pm 0.017^{\ \ \ }$      & $0.627 \pm 0.036$ \\
    	AuxLoss                           &      $0.803 \pm 0.006^{\ \ }$       &     $0.787 \pm 0.005^{*\ \ }$      & $0.753 \pm 0.002$ \\
    	AuxLoss-CL                        & $\mathbf{0.809 \pm 0.003}^{*\dagger}$ &     $0.788 \pm 0.005^{\dagger\ \ }$      & $\mathbf{0.771 \pm 0.003}$ \\
    	\arrayrulecolor{black}\bottomrule &
    \end{tabular}

\end{table}

\subsection{How does the lateral view help in prediction?}
As evidenced in Section \ref{perform}, there is a major difference between the multi-view models when it comes to the view(s) they have to use for making a prediction. We wanted to examine how the AUC is affected when the proportion of patients having a paired L view in the test set is varied. We do this in \figureref{fig:lprop} by iteratively removing the lateral view for 1\% of the patients and testing the models, incrementing the percent removed until 100\%. 



The most conspicuous observation is that the DualNet does not seem to use the lateral view for prediction at all. HeMIS is also unusual where adding paired views actually hurts performance until the proportion crosses a threshold. This could be caused by the usage of variance in combining the views: when HeMIS was tested on only PA images, the variance of the feature maps was substituted with a zero tensor causing the model to believe there was zero variance. As we add patients with paired lateral views, smaller proportions of paired views led to high variance causing the dip in the curve in \figureref{fig:lprop}. Past a certain proportion, the variance of the feature maps started approximating the population variance which led to increasing performance. 



The performance of the multi-view models in the three testing regimes (\tableref{tab:results}) suggests that the AuxLoss model combined with curriculum learning utilizes the lateral view most productively. We use AuxLoss-CL to find the effect of the lateral view on individual labels in the dataset: we perform a comparison between the label-wise AUC of AuxLoss-CL and DenseNet-PA averaged over all runs. If the difference between the average AUC for a given label is lesser than the standard deviation of the better model, we annotate that label as indifferent to the model used. We find that 32 of the 64 labels see an improvement in AUC when both the views are used jointly, visualized in \figureref{fig:betterlateral}

\begin{figure}[htbp]
    \centering

    \includegraphics[width=1\textwidth]{imgs/lprop.pdf}

    \caption{\small Change in AUC as proportion of patients with paired lateral views increase. Only multi-view models can be evaluated here. Here 0.2 indicates that 20\% of the PA image samples used to test the model had a corresponding L view and the rest were treated as single images (If they had a L view in the dataset it was ignored).}
    \label{fig:lprop}

\end{figure}

\begin{figure}[ht]
    \centering

    \includegraphics[width=1\textwidth]{imgs/fig1.pdf}

    \caption{\small The subset of the 64 labels we use in PadChest that see an improvement in AUC with AuxLoss-CL. The improvement is considered relative to the AUC achieved by the single view DenseNet-PA which is denoted by the darker bar. The number in the brackets refers to the number of samples with the label}
    \label{fig:betterlateral}
\end{figure}





\subsection{Benefits of increasing the training set size}
 We form a second dataset by adding the remaining patients that have a PA view available; we call this the \textbf{extended} dataset. This adds 27,576 PA images from new patients which are used only for training the model. We trained the DenseNet-PA and AuxLoss-CL models on this extended dataset, and evaluated them on the main test set. Those results are reported in Table~\ref{tab:mainextended}.
 
 The DenseNet-PA model gains one percent of AUC from doubling the dataset size. This marginal increase in performance from a major increase in dataset size is consistent with what is observed in other computer vision tasks \citep{sun2017}. We examine how the AUC of individual labels changes between the DenseNet-PA trained on the main and extended datasets and find that, similar to AuxLoss-CL, 32 labels see an increase with 22 of them overlapping with the AuxLoss-CL. The top three increases are in the labels `azygos lobe' ($+24.1\%$), `tracheal shift' ($+12.4\%$) and `diaphragmatic eventration' ($+6.3\%$). A plot detailing all the improvements in the style of \figureref{fig:betterlateral} is included in Appendix \ref{paextbars}.
 
AuxLoss-CL is indistinguishable in performance from the extended DenseNet-PA. It also became worse at utilizing both views, likely due to the dominant presence of PA images. Careful over-sampling of joint and lateral images during the curriculum learning might fix that problem.
 
\begin{table}[th]
\renewcommand{\arraystretch}{1.5}
\centering

\caption{\small AUC and standard deviation of the DenseNet-PA and AuxLoss models trained on the main and extended dataset but evaluated on the same main test set. The size of the training set for the extended dataset is double that of the main. The Main AUC column copies the values from \tableref{tab:results} for easier reference.}    \renewcommand{\arraystretch}{1.3} 

\begin{tabular}{lccc}
\toprule
\small 

& & \multicolumn{2}{c}{\smaller Train Data (AUC on test reported)} \\
\textbf{Model} & \textbf{Test Data} &\textbf{Main} & \textbf{Extended} \\
\midrule
DenseNet-PA & PA & 0.793 $\pm$ 0.007 & 0.813 $\pm$ 0.005 \\
AuxLoss-CL  & PA & 0.788 $\pm$ 0.005 & 0.812 $\pm$ 0.006 \\
AuxLoss-CL  & Both & 0.809 $\pm$ 0.003  & 0.772 $\pm$ 0.018 \\       
\bottomrule
\end{tabular}


\label{tab:mainextended}
\end{table}


 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Discussion}
\label{sec:conclusion}


To return to the questions posed in the introduction, it appears a well-tuned PA-only model is competitive with a well-tuned joint model. However, for the specific labels shown in Figure~\ref{fig:betterlateral}, the lateral view allows a statistically significant improvement relative to using just the PA view. We proposed a joint model using auxiliary tasks that gives strong results with any subset of views.

While all multi-view models offer similar performance when given both views, they do not perform as well when given only one view. Testing these models on different proportions of paired L images brings very intriguing observations to surface. First, it indicates that models other than AuxLoss were relying too heavily on the PA view to make a prediction, especially DualNet which completely ignores the L view. Second, it might explain how the Stacked model performs well on the first two testing regimes: it just learns to use less information from the second channel of the input image to favour predicting from the first channel only. Third, adding curriculum learning to AuxLoss mitigates this reliance on the PA view but does not have much effect on HeMIS. 

For the second question, we found that doubling the number of PA images in the training set only gives a marginal increase in performance. A joint model trained on only half the data gives the same performance. If it is more costly to bring in a new patient than to acquire another view for an already present patient, then our results suggest acquiring both views for a smaller number of patients rather than one view on twice that number.

We also find that although the different approaches achieve similar performance with both views, the training is less sensitive to hyperparameters for the AuxLoss and DualNet models shown in Figure \ref{fig:hyperopt}. With this result, we would conclude that these models would be easier to train as the range of optimal hyperparameters would be relatively wider. 



This study has two limitations. First, while we have shown that the lateral view is useful for some labels, we did not sort those by clinical relevance. This can be seen notably by the presence of ``electrical device" or ``catheter". It might be that most of those labels are extremely rare, or typically detected through other means than a chest x-ray. Second, this study is based entirely on the PadChest dataset. The population differs from other available datasets, at least by ethnicity, but likely also by clinical practice. In other words, we expect both covariate shift and concept shift \citep{moreno-torres2012}. This implies that while the lateral view will still be useful for many labels, those labels might not be the same as the ones we found here. 


\begin{figure}[htbp]
    \centering

    \includegraphics[width=1\textwidth]{imgs/hyperopt_ridge.pdf}

    \caption{\small Distributions of AUC for a 40 combination hyperparameter search for each model. Some models are much more robust to hyperparameter changes than others.}

    \label{fig:hyperopt}
\end{figure}


% Acknowledgments---Will not appear in anonymized version
\midlacknowledgments{We thank Paul Bertin for his code and Mathieu Germain for discussions. We thank AcademicTorrents.com for making data available for our research. This work is partially funded by a grant from the Institut de valorisation des donnees (IVADO).  This work utilized the supercomputing facilities managed by Mila, NSERC, Compute Canada, and Calcul Quebec. We also thank NVIDIA for donating a DGX-1 computer used in this work.}


\bibliography{hashir20}


\newpage
\appendix

\section{Improvement in labels with extended dataset and paired views}\label{paextbars}

\begin{figure}[htbp]
    \centering
    \includegraphics[width=\textwidth]{imgs/paextimprove.pdf}
    \caption{\small The subset of the 64 labels we use in PadChest that see an improvement in AUC when the \textbf{extended dataset} is used with the DenseNet-PA model. The improvement is considered relative to the AUC achieved by the single view DenseNet-PA on the main dataset which is denoted by the darker bar. The number in the brackets before and after the plus sign refers to the number of samples in the main and extended dataset with the label}
    \label{fig:betterext}
\end{figure}

\begin{table}[]
\centering
 \caption{\small Increase in AUC relative to DenseNet-PA (Main)  for labels that improved with both DenseNet-PA (Extended) and  AuxLoss (Main). Averaged over all runs}
\begin{tabular}{lcc}
\toprule
\multirow{2}{*}{\textbf{Labels}} & \multicolumn{2}{c}{\textbf{AUC increase}} \\
                                 & DenseNet-PA (Extended) &  AuxLoss-CL (Main)  \\ \midrule
azygos lobe                      &         0.241          &      0.076       \\
diaphragmatic eventration        &         0.063          &      0.052       \\
hernia                           &         0.059          &      0.043       \\
air fluid level                  &         0.051          &      0.021       \\
hemidiaphragm elevation          &         0.040          &      0.036       \\
catheter                         &         0.040          &      0.056       \\
fibrotic band                    &         0.037          &      0.020       \\
mass                             &         0.032          &      0.023       \\
mediastinal enlargement          &         0.031          &      0.014       \\
volume loss                      &         0.028          &      0.024       \\
costophrenic angle blunting      &         0.026          &      0.035       \\
aortic elongation                &         0.026          &      0.013       \\
pulmonary fibrosis               &         0.025          &      0.042       \\
artificial heart valve           &         0.024          &      0.034       \\
aortic atheromatosis             &         0.023          &      0.014       \\
emphysema                        &         0.018          &      0.021       \\
surgery                          &         0.015          &      0.029       \\
pneumonia                        &         0.014          &      0.016       \\
pleural effusion                 &         0.010          &      0.024       \\
COPD signs                       &         0.009          &      0.019       \\
cardiomegaly                     &         0.008          &      0.004       \\
electrical device                &         0.004          &      0.004       \\ \bottomrule
\end{tabular}

\end{table}

\pagebreak

\section{Hyperparameters}\label{hypersec}

\subsection{Stability of training}


For each model, we performed a random search using Orion ~\citep{orion} over the learning rates, the amount of dropout, and for CL models the probability of dropping a view for each batch. 40 combinations of those hyperparameters were tried per model. 

In Figure~\ref{fig:hyperopt}, we show the distributions of the AUC on the validation set over the combinations, per model. We observe strong variations in the shape of those distributions. HeMIS, for example, is very sensitive to the choice of hyperparameters, whereas AuxLoss is much more concentrated. 


\subsection{Training details}
For all models, we use the Adam optimizer and a batch size of 8 and train up to 40 epochs. We also compute class weights (clamped at 5) to balance the loss as the labels are highly imbalanced. All models use early stopping based on the AUC achieved on the validation set. 

\paragraph{Learning rate} The initial learning rate differs for every model but it is decayed by a factor of 10 halfway through training for all. These model-specific hyperparameters were found through an extensive random search and are given in \tableref{tab:hyperparams}. 

\paragraph{Dropout} We use dropout with a different probability for each model, given in \tableref{tab:hyperparams}

\paragraph{Dataset} We utilize a 60-20-20 split in the dataset for train, validation and test sets. We also use data augmentation such as adding random translations and rotations. We also add random Gaussian noise to the image. 

\paragraph{Loss weights for AuxLoss} AuxLoss uses a weighted sum of the three losses (averaged over all labels) with weights of 1.0, 0.3 and 0.3 for the joint, PA and L losses. 

\paragraph{Curriculum learning} On HeMIS, we randomly drop one of the views with probability 0.25 each. For AuxLoss, we randomly select one of the PA or L \textbf{loss} with probability of 0.2 for each and use that to update that view's branch instead of the entire model. The weighted sum of the three losses is used to update the entire model only 60\% of the time. This is done to make the model rely less on having both views available all the time. 

\begin{table}[htbp]
    \centering
        \caption{Hyperparameters of the best models found. Most joint models are composed of 3 parts, each with a different learning rate: the PA branch, the L branch, and a common branch. Curriculum learning (CL) models have an additional hyperparameter which is the probability of dropping one view for any given sample.}
    \begin{tabular}{l@{\hskip 0.7in}ccc}
    	\toprule
    	\textbf{Model}   & LR                                    & Dropout & View dropping \\ 
    	\midrule
    	DenseNet-PA      & $5.8e^{-4}$                           & 0.0     &                           \\
    	DenseNet-L       & $2.6e^{-4}$                           & 0.2     &                           \\
    	Stacked          & $1.9e^{-4}$                           & 0.1     &                           \\
    	DualNet          & $3.0e^{-4}$, $7.6e^{-4}$, $2.7e^{-4}$ & 0.2     &                           \\
    	HeMIS            & $3.8e^{-4}$, $2.0e^{-5}$, $2.8e^{-5}$ & 0.1     &                           \\
    	HeMIS-CL         & $1.7e^{-4}$, $5.6e^{-4}$, $7.2e^{-5}$ & 0.1     & 0.5                       \\
    	AuxLoss          & $2.1e^{-4}$, $1.9e^{-4}$, $6.6e^{-4}$ & 0.2     &                           \\ 
    	AuxLoss-CL       & $6.9e^{-5}$, $9.5e^{-5}$, $5.2e^{-5}$ & 0.1     & 0.4                       \\ 
    	\bottomrule
    \end{tabular}

    \label{tab:hyperparams}
\end{table}

\end{document}

