\documentclass{midl} % Include author names
%\documentclass[anon,table]{midl} % Anonymized submission

% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution

\usepackage{natbib}
\usepackage{amsmath}
\usepackage{booktabs}
%\usepackage{todonotes}
%\setlength{\marginparwidth}{2cm}
\usepackage{multirow}

%\usepackage{caption}
%\captionsetup{aboveskip=0pt,belowskip=0pt}
%\usepackage{wrapfig}

\usepackage{setspace}
\usepackage{enumitem}
\usepackage{listings}
\usepackage{titlesec}
\usepackage{xcolor}

\jmlryear{2021}
\jmlrworkshop{Full Paper -- MIDL 2021}
%\editors{Under Review for MIDL 2021}


\usepackage{xcolor}
\hypersetup{
    colorlinks,
    citecolor=gray,
    linkcolor=black,
    urlcolor=black
}

\title[Gifsplanation via Latent Shift]{Gifsplanation via Latent Shift: \\A Simple Autoencoder Approach to \\Counterfactual Generation for Chest X-rays}

 % Use \Name{Author Name} to specify the name.
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

%  Two authors with the same address
% \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\and
% \Name{Author Name2} \Email{xyz@sample.edu}\\
% \addr Address}

 % Three or more authors with the same address:
 % \midlauthor{\Name{Author Name1} \Email{an1@sample.edu}\\
 %  \Name{Author Name2} \Email{an2@sample.edu}\\
 %  \Name{Author Name3} \Email{an3@sample.edu}\\
 %  \addr Address}


% Authors with different addresses:
% \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\\
% \addr Address 1
% \AND
% \Name{Author Name2} \Email{xyz@sample.edu}\\
% \addr Address 2
% }

%\footnotetext[1]{Contributed equally}

% More complicate cases, e.g. with dual affiliations and joint authorship
\midlauthor{%
\Name{Joseph Paul Cohen\nametag{$^{1,2,3}$}} \Email{joseph@josephpcohen.com}\\
\Name{Rupert Brooks\nametag{$^{4}$}} \Email{rupert.brooks@nuance.com}\\
\Name{Sovann En\nametag{$^{4}$}} \Email{sovann.en@nuance.com}\\
\Name{Evan Zucker\nametag{$^{1,2}$}} \Email{ezucker@stanford.edu}\\
\Name{Anuj Pareek\nametag{$^{1,2}$}} \Email{anujpare@stanford.edu}\\
\Name{Matthew Lungren\midljointauthortext{Contributed equally}\nametag{$^{1,2}$}} \Email{mlungren@stanford.edu}\\
\Name{Akshay Chaudhari\midlotherjointauthor\nametag{$^{1,2}$}} \Email{akshaysc@stanford.edu}\\
\addr $^{1}$ Stanford University Center for Artificial Intelligence in Medicine \& Imaging  \\
\addr $^{2}$ Stanford University Department of Radiology \\
\addr $^{3}$ Mila, Quebec AI Institute \\
\addr $^{4}$ Nuance Communications \\
}

\begin{document}

\maketitle

\begin{abstract}
\noindent \textbf{Motivation:}
Traditional image attribution methods struggle to satisfactorily explain predictions of neural networks. Prediction explanation is important, especially in medical imaging, for avoiding the unintended consequences of deploying AI systems when false positive predictions can impact patient care. Thus, there is a pressing need to develop improved models for model explainability and introspection. 

\noindent \textbf{Specific problem:}
A new approach is to transform input images to increase or decrease features which cause the prediction. However, current approaches are difficult to implement as they are monolithic or rely on GANs. These hurdles prevent wide adoption.

\noindent \textbf{Our approach:}
Given an arbitrary classifier, we propose a simple autoencoder and gradient update (Latent Shift) that can transform the latent representation of a specific input image to exaggerate or curtail the features used for prediction. We use this method to study chest X-ray classifiers and evaluate their performance. We conduct a reader study with two radiologists assessing 240 chest X-ray predictions to identify which ones are false positives (half are) using traditional attribution maps or our proposed method.

\noindent \textbf{Results:}
We found low overlap with ground truth pathology masks for models with reasonably high accuracy. However, the results from our reader study indicate that these models are generally looking at the correct features.
We also found that the Latent Shift explanation allows a user to have more confidence in true positive predictions compared to traditional approaches (0.15±0.95 in a 5 point scale with p=0.01) with only a small increase in false positive predictions (0.04±1.06 with p=0.57).


% \noindent \textbf{Conclusion:}
% We find that this simple approach works well yet raises challenges on how to systematically evaluate what models have learned. 
% However (TODO) we also find that this approach improves clinicians abilities to detect incorrect predictions.

 
\end{abstract}

% \begin{keywords}
% Attribution Maps, Interpretable Deep Learning, Chest X-ray
% \end{keywords}

% TLDR: Explaining neural network predictions by transforming input images to exaggerate or curtail the features used for prediction. Then studies the impact on Chest X-ray interpretation by radiologists.

\section{Introduction}
\label{intro}

It is important to understand why a neural network model is making a prediction to ensure that it is using features that we would expect as well as discovering what unknown features a model is using. Typically 2D attribution maps are used which are based on a 1st order approximation of the neural network \cite{Simonyan2014} but these have limitations as they may just represent edges \citep{Adebayo2018SanityChecks} or simply not indicate the features that are really being used \citep{Viviano2020, Arun2020MedicalSaliency,Arun2020CXRSanityCheck}.

Recently, the idea to visualize predictions via exaggerating features that change the predictions of a model has been discussed by \citet{Singla2020Exaggeration,Singla2021BlackBoxSmoothly}. This exaggeration is the result of a neural network's ability to hallucinate features \citep{cohen2018distribution,Baumgartner2018VisualFeatureAttribution} which is known to be controllable \citep{mirza2014conditional, Wang2020LatentStyle}. Instead of simply generating images of a specific class, these exaggeration methods can explain the specific features used by a classifier to make each prediction. This is valuable in detecting when a model predicts using incorrect spurious correlates to ensure it is right for the right reasons \cite{Ross2017rrr, Zech2018}. While most image pathology prediction models have expected causal relationships where specific image regions explicitly lead to the classification label (Enlarged heart $\rightarrow$ Cardiomegaly), models predicting future risk (e.g. 5 year mortality) do not have such a known causal relationship. In these scenarios, we can learn which features are being used with these methods and viewing the counterfactual image.

However, there are two major downsides to existing approaches to this task which limit their adoption. 1) They are based on GANs \citep{Goodfellow2014GANs} which can be very difficult and time consuming to train because of loss function stability and hyperparameter sensitivity. 2) They are monolithic models that require the generative and discriminative components to be trained together which prevents working with existing pretrained models.

One would prefer an approach which is modular, as simple as possible to implement, and able to work with any existing classifier as a drop in replacement for gradient based attribution maps.
%If this were the case these approaches could be used to explain the predictions of existing classifiers as a drop in replacement for gradient based attribution maps. 

Our approach requires a latent variable model, such as a simple autoencoder $D(E(x))$ where $E$ is the encoder and $D$ is the decoder, and a classifier $f$ which predicts a target $y$ as follows: $y = f(x)$. The latent variable model and the classifier are trained independently without any special considerations except for being differentiable. We specifically use an autoencoder because it is simple to implement and train and we believe this will increase adoption of this method. 

Once these models are trained, an explanation can be computed as follows. An input image $x$ is encoded using $E(x)$ producing a latent representation $z$. Perturbations of the latent space are computed for a classifier $f$ in Eq \ref{eq:shift} which is then used to produce $\lambda$-shifted samples shown in Eq \ref{eq:gen}.

\noindent\begin{minipage}{.5\linewidth}
\begin{equation}
z_{\lambda} = z + \lambda\frac{\partial f(D(z))}{\partial z}
\label{eq:shift}
\end{equation}
\end{minipage}%
\begin{minipage}{.5\linewidth}
\begin{equation}
x'_{\lambda} = D(z_{\lambda})
\label{eq:gen}
\end{equation}
\end{minipage}


The image $x'_{\lambda}$ now is expected to produce a higher prediction such that $f(x'_{\lambda}) > f(x)$. From here we can generate multiple $x'_{\lambda}$ images to exaggerate or remove features which result in a prediction (explored in \S \ref{sec:lambda}). These images can be stitched together into short videos (gifs) that help to explain why a prediction was made and what representation the classifier had about the concept.
Examples available online\footnote{\url{https://mlmed.org/gifsplanation/}}.


\begin{figure*}[t]
    \centering
    \includegraphics[width=0.99\textwidth]{img/gif-overview-fig.pdf}
    \caption{\small A) Overview of the Latent Shift method. The encoder and decoder parts of the autoencoder are shown in gray. The classifier $f$ predicts if the CXR has a `Mass'. The image is input from the left and multiple different versions are reconstructed. B) Optical flow computed on a sequence of generated images to visualize what is changing. C) The prediction of `Mass' changes as the $\lambda$ value changes the reconstructed image. }
    \label{fig:overview}
\end{figure*}

An overview of this method is shown in Figure \ref{fig:overview}. With this approach it is important to keep in mind that this method is limited by the latent representation of the autoencoder. If the decoder is not expressive enough then it will not be able to correctly represent the features used by the classifier. Fortunately, this approach allows multiple classifiers to be compared with a fixed autoencoder (or the choice of latent variable model) and allows a clear understanding about the different representations between the models.

In essence we want the exact opposite of an adversarial attack. If we were just modifying the image using the gradient $\frac{\partial f(x)}{\partial x}$, which is a traditional  adversarial attack, the modification would be imperceivable and distort the image by selecting spurious pixels which happen to have an impact on the target variable. Our approach regularizes this process using a fixed decoder to keep the image on the data manifold and prevent these spurious pixels from changing. Overall, we seek to modify only the most semantically meaningful pixels that lead to a particular classification output. The contributions of our work follow:

\begin{enumerate}[topsep=3pt,itemsep=2pt]
    \item Propose a simple and elegant approach to counterfactual generation as well as a way to calculate a replacement for a traditional 2D attribution map.
    \item Explore the attribution of chest X-ray predictions using this method compared to traditional methods in terms of IoU overlap with expert masks and cascading randomization analysis.
    \item Study how this method impacts a radiologist's ability to interpret the prediction of a model compared to traditional attribution methods when presented with false positive predictions.

\end{enumerate}



\section{Related Work}
The idea of decoupling models was raised before and these approaches are similar in spirit to our approach in how they walk around the latent space although they have different formulations and utilize GANs. \citet{Schutte2020StyleGanMed} learned a small function to map the latent variable to a predicted target and use it to transform the latent variable.
\citet{Joshi2018xGEMs} moves in the latent space based on the classifiers loss function in order to change the class of the image. They recursively  modify the latent variable until the class changes. 


% . Both approaches navigate the latent space of a GAN by either learning a small function or using the classifier's loss function to flip the class of a sample. 


% Their approach used a GAN as a latent variable model and then learned a small function to map the latent variable to a predicted target value and use it to transform the latent variable. Our approach is similar in spirit but is achieved without learning a small function. 

% The work of \citet{Joshi2018xGEMs} walks along the latent space of a detached latent variable model in order to find examples of an image with the class changed. Their approach uses a GAN and their formulation is 


% which is explored in \S \ref{sec:lambda}

% With this formulation 
% Notes
% Notion of realism?
% Notion of distance?
% why not constrain pixels?


%\todo{read https://arxiv.org/abs/2004.01610}





%Related work:

%\citep{Sundararajan2016Counterfactuals}


%\citep{Ancona2017UnderstandGradientAttrib}


%\citep{Wang2020LatentStyle}

%LIME \citep{Ribeiro2016LIME}

%\citep{Arun2020MedicalSaliency}


\section{Protocol and Materials}



\subsection{Chest X-ray classifiers}

Three DenseNet121-based classifiers from existing publications were used. There is no requirement for this specific architecture but there are not many publicly available chest X-ray models.
Two models are from the paper \citep{cohen2020limits} referred to as the XRV-all and XRV-mimic\_ch. The XRV-all model is jointly trained on 7 CXR datasets (NIH, PC, CheX, MIMIC-CXR, Google, OpenI, RSNA which are described in Appendix \S \ref{sec:datasets}). The XRV-mimic\_ch model is trained on only MIMIC-CXR \cite{Johnson2019mimic-cxr}. The other model is from the JF Healthcare group \citep{ye2020weaklychexpert} which was built for the CheXpert challenge \cite{Irvin2019CheXpert} and at one point was ranked 1st on the leaderboard.

\subsection{Generating 2D attribution maps}

There are a few ways to generate a 2D Latent Shift attribution map which would be comparable to a typical attribution map. Here we will discuss the latentshift-max method which was found to work best. This method takes a sequence of $x'_\lambda$ images between a specific $\lambda$ range (discussed in \S \ref{sec:lambda}). First the absolute difference between the non-shifted reconstruction $x'$ and each of the shifted $x'_\lambda$ images is computed. Then the maximum difference at a per pixel level is computed to produce the final attribution map. Intuitively, this captures the maximum change as the result of the shift. More options for this conversion are discussed in Appendix \S \ref{sec:2d3d}. 

\subsection{Baseline attribution methods}

The baseline method of $input$ $gradients$ (referred to as $grad$) computes the absolute gradient of the input with respect to the prediction made for all images of the positive class $|\frac{\partial \hat{y_1}}{\partial \mathbf{x}}|$ \cite{Simonyan2014}. The method \textit{Guided Backprop} \citep{Springenberg2014GuidedBackprop} (referred to as \textit{guided}) tries to ignore gradients that cancel each other out by only backpropagating positive gradients. The method \textit{Integrated Gradients} \citep{Sundararajan2017integrated} (referred to as \textit{integrated}) works by integrating gradients between the input image $x_i$ and an all-zero baseline image. 

\subsection{Mask annotation datasets and IoU calculation}
\label{sec:ioucalc}

Expert mask annotations were used to evaluate attribution maps. Bounding boxes from the NIH dataset \cite{WangNIH2017} were used for Atelectasis, Cardiomegaly, Effusion, and Mass. Segmentation masks from the RSNA Pneumonia Challenge \cite{Shih2019RSNAKaggle} were used for Lung Opacity. Segmentation masks from the SIIM-ACR Pneumothorax Challenge \cite{Filice2020SIIMPneumothorax} were used for Pneumothorax. Additional details in Appendix \S \ref{tab:mask_counts}.

To fairly compute an IoU value (intersection over union; $\text{IoU}(\text{mask},\text{img}) = \frac{\text{mask} \cap \text{img}}{\text{mask} \cup \text{img}}$) for the 2D attribution methods we followed \cite{Viviano2020} where a binarized attribution map is created such that the top $p$ percentile pixels were set to 1, where $p$ is dynamically set to the number of pixels in the ground truth mask that it is being compared to.

\section{Experiments}

All source code\footnote{\url{https://github.com/mlmed/gifsplanation}} and datasets (see \S \ref{sec:datasets}) are publicly available. The classifiers, autoencoder and their respective pre-trained weights as used in this work are available in TorchXRayVision 0.0.24 \citep{Cohen2020xrv}. PyTorch 1.6.0 \citep{paszke2017pytorch} and Captum 0.3.0 \citep{kokhlikyan2020captum} were used for model training and feature attribution, respectively. 

\subsection{Autoencoder architecture and training}

Keeping with our goal to build the most straightforward model, a ResNet \citep{he2016resnet} convolutional autoencoder was used as it is able to achieve high fidelity image reconstruction and is relatively easy to implement. An elastic (squared + absolute) loss was used to capture both large and small features.
This model was trained on 4 large datasets NIH, PC, RSNA, and MIMIC. %(See Appendix \S \ref{sec:datasets}).

The bottleneck of the autoencoder is a major variable in the quality of the explanations. In Figure \ref{fig:bottleneck} the bottleneck size is varied and latentshift-max images are computed using the XRV-all model to predict Cardiomegaly (an enlarged heart). Looking qualitatively at the generated image explanations and their corresponding videos we observe that a large bottleneck results in spotty changes in the region of interest but they don't appear to clearly vary the pathology. At smaller bottleneck sizes the size if the heart appears to be controlled. However, if it is too small then small features, such as the ribs, are lost. In further experiments a ResNet101 with a bottleneck size of 4608 is used.

Unexpectedly we find that larger bottleneck sizes have a higher IoU but they do not result in a better explanation when viewed qualitatively. The shifted images do not appear to have a smooth transition between each other and changes appear unrelated to the pathology. This brings to question how well the IoU analysis captures the quality of these approaches. 
During training we find that as validation MAE decreases later in training the IoU also goes down. This indicates that the specific reconstruction error seems sufficient only initially in training. Likely towards the end of training,  minimizing the small details hurts the ability to control major features of the images. See Appendix \S \ref{sec:aedesign} for more plots.


\begin{figure}[t]
    \centering
    \includegraphics[width=1.0\textwidth]{img/ae-plots2.png}
    \caption{\small The latentshift-max method to generate 2D attribution maps is applied to the same image using autoencoders which vary in bottleneck size. The top 95\% of explanation pixels are shown.}
    \label{fig:bottleneck}
\end{figure}

\begin{figure}
    \centering
    \includegraphics[width=1.0\textwidth]{img/latent-range.png}
    \caption{\small Example of the pathology prediction as $\lambda$ is moved along the latent shift axis for 3 different classification models on the same image. The same autoencoder is used in all cases. At $\lambda$=0 the prediction is the classifiers output on the unmodified reconstructed image.}
    \label{fig:shift-plot}
\end{figure}

\subsection{Determining the \texorpdfstring{$\lambda$}{lambda} range}
\label{sec:lambda}

When making changes to the latent representation it is important to control the extent of the change. Too little and the difference between the images won't be significant enough to change the prediction of the model. Too large and the image will become too distorted and won't represent the pathology. 

In Figure \ref{fig:shift-plot} the latent representation is varied by different $\lambda$ values for three different models on four different tasks. Here the direction of the change in the latent space is defined by the gradient computed for each model. We observe there is variation between how the prediction changes for each model. The smoothness here is a sign that the representation is good. Surprisingly the dynamic range of the predictions between these tasks is similar. We observed that this range is decoder specific and different decoders will have much larger or smaller dynamic ranges.
When creating sequences of images we utilize a simple iterative search algorithm to determine the lower and upper $\lambda$ values (see Appendix \S \ref{sec:lambdasearch}). The lambdas are chosen such that the prediction decreases by 30\% and increases by 5\%. We find the pathologies seem most clear when the image sequence removes the pathology in contrast to prior work which exaggerates it. 




\begin{figure*}[t]
    \centering
    \includegraphics[width=0.95\textwidth]{img/attribution-maps.jpg}
    \caption{\small The XRV-all and jfhealthcare models make positive predictions on images for Cardiomegaly. These predictions are explained using multiple 2D attribution maps. A expert bounding box is shown for Cardiomegaly in yellow. No Gaussian blur is applied to these attribution maps.}
    \label{fig:attributionmaps}
\end{figure*}

\subsection{Qualitative 2D attribution map comparison}

In Figure \ref{fig:attributionmaps} qualitative results are shown when varying the model and pathology across multiple attribution methods. One very notable difference is that this method produces a smoother attribution map without blurring. The gradient based approaches have a speckled pattern which is typically alleviated using Gaussian blur. Between the two models evaluated we can see that similar regions are highlighted but they also have distinct differences. This variability is a powerful aspect of this method because we can study the different features used between models. Here it appears that the JF Healthare model mostly looks at the right side (chest right = image left) of the heart while the XRV-all model looks at both sides. This is also confirmed by looking at the generated videos. 2D images only present a small amount of information that this method provides. Videos and images can be seen side by side at this URL\footnote{\url{https://mlmed.org/gifsplanation/}}

\subsection{Quantitative IoU comparison}
\label{sec:quantiou}
The different 2D attribution maps are compared based on their IoU in Table \ref{tab:iou}. This experiment confirms that this method produces similar attributions as other methods.  While two models achieve reasonable AUC scores for Pneumothorax their IoU scores are extremely low which indicates either the pathology is predicted using spurious features, the bounding boxes are wrong, or that the model is predicting using some confounding pathology. The overall low scores yet high AUC bring into question the validity of using bounding box or mask information to evaluate attribution methods.


\begin{table*}[t]
%\small
\definecolor{c1}{HTML}{f6edcf}
\definecolor{c2}{HTML}{f0dab1}
\definecolor{c3}{HTML}{daf1f9}
\caption{\small The IoU and AUC is evaluted for 4 attribution methods are studied over 3 models. For each task the IoU was calculated as the mean over 80 samples. The AUC was calculated as the mean over 2048 samples from the same dataset. Note that we compute the best case IoU (see \S \ref{sec:ioucalc}).}
\label{tab:iou}
\centering
\resizebox{0.97\columnwidth}{!}
{%
\begin{tabular}{lccl
c
c
c
c
c
c}
\toprule 
 & & &  Model $\rightarrow$ & \multicolumn{2}{c}{ XRV-all} & \multicolumn{2}{c}{XRV-mimic\_ch} & \multicolumn{2}{c}{JF Healthcare} \\
Task & Dataset & Example & 2D Method & AUC & IoU & AUC & IoU & AUC & IoU \\
\midrule
\multirow{4}{*}{Atelectasis} & 
\multirow{4}{*}{NIH} & 
\multirow{4}{*}{\includegraphics[width=53pt]{img/mask-Atelectasis-NIH-310.png}} & 
grad & & 0.07$\pm$0.07 &  & 0.06$\pm$0.07 & & \textbf{0.13$\pm$0.10} \\
& & & guided &  & 0.09$\pm$0.08 &  & 0.04$\pm$0.04 &  & 0.10$\pm$0.07 \\
& & & integrated &  & 0.05$\pm$0.05 &  & 0.04$\pm$0.05 &  & 0.10$\pm$0.09 \\
& & & latentshift-max & \multirow{-4}{*}{0.78}  & \textbf{0.11$\pm$0.12} & \multirow{-4}{*}{0.70}  & \textbf{0.08$\pm$0.11} & \multirow{-4}{*}{0.77}  & 0.09$\pm$0.09 \\
\midrule
\multirow{4}{*}{Cardiomegaly} & 
\multirow{4}{*}{NIH} & 
\multirow{4}{*}{\includegraphics[width=53pt]{img/mask-Cardiomegaly-NIH-865.png}} & 
grad & & \textbf{0.35$\pm$0.05} & & \textbf{0.25$\pm$0.09} & & \textbf{0.45$\pm$0.04} \\
& & & guided &  & 0.28$\pm$0.06 &  & 0.15$\pm$0.06 &  & 0.31$\pm$0.05 \\
& & & integrated &  & 0.27$\pm$0.08 &  & 0.15$\pm$0.08 &  & 0.36$\pm$0.09 \\
& & & latentshift-max & \multirow{-4}{*}{0.90} & 0.33$\pm$0.07 & \multirow{-4}{*}{0.69}  & 0.21$\pm$0.09 & \multirow{-4}{*}{0.90} & 0.35$\pm$0.09 \\
\midrule
\multirow{4}{*}{Effusion} & 
\multirow{4}{*}{NIH}  & 
\multirow{4}{*}{\includegraphics[width=53pt]{img/mask-Effusion-NIH-9157.png}} & 
grad &  & 0.12$\pm$0.09 & & 0.08$\pm$0.08 & & \textbf{0.18$\pm$0.10} \\
& & & guided &  & 0.15$\pm$0.09 &  & 0.06$\pm$0.05 &  & 0.14$\pm$0.07 \\
& & & integrated &  & 0.11$\pm$0.08 &  & 0.05$\pm$0.06 &  & 0.14$\pm$0.09 \\
& & & latentshift-max & \multirow{-4}{*}{0.87} & \textbf{0.16$\pm$0.11} & \multirow{-4}{*}{0.80}  & \textbf{0.11$\pm$0.11} & \multirow{-4}{*}{0.87}  & 0.16$\pm$0.10 \\
\midrule
\multirow{4}{*}{Mass} & 
\multirow{4}{*}{NIH}  & 
\multirow{4}{*}{\includegraphics[width=53pt]{img/mask-Mass-NIH-3160.png}} & 
grad &  & 0.16$\pm$0.14 &  &  & &  \\
& & & guided &  & \textbf{0.19$\pm$0.16} & \multicolumn{2}{c}{Model does}   & \multicolumn{2}{c}{Model does}   \\
& & & integrated &  & 0.13$\pm$0.13 & \multicolumn{2}{c}{not predict}   & \multicolumn{2}{c}{not predict}   \\
& & & latentshift-max & \multirow{-4}{*}{0.82} & 0.14$\pm$0.17 &  &  &  &  \\
\midrule
\multirow{4}{*}{Lung Opacity} & 
\multirow{4}{*}{RSNA}  & 
\multirow{4}{*}{\includegraphics[width=53pt]{img/mask-Lung-Opacity-RSNA-4.png}} & 
grad &  & \textbf{0.21$\pm$0.11} &  & 0.13$\pm$0.09 & &  \\
& & & guided &  & 0.21$\pm$0.12 &  & 0.09$\pm$0.07 & \multicolumn{2}{c}{Model does}   \\
& & & integrated &  & 0.17$\pm$0.10 &  & 0.08$\pm$0.07 & \multicolumn{2}{c}{not predict}  \\
& & & latentshift-max & \multirow{-4}{*}{0.84} & 0.20$\pm$0.13 & \multirow{-4}{*}{0.75} & \textbf{0.15$\pm$0.14} &  &  \\
\midrule
\multirow{4}{*}{Pneumothorax} & 
\multirow{4}{*}{SIIM-ACR}  & 
\multirow{4}{*}{\includegraphics[width=53pt]{img/mask-Pneumothorax-SIIM-10.png}} & 
grad &  & 0.01$\pm$0.02 &  & 0.01$\pm$0.02 &  &  \\
& & & guided &  & \textbf{0.03$\pm$0.05} &  & 0.02$\pm$0.03 & \multicolumn{2}{c}{Model does}  \\
& & & integrated &  & 0.01$\pm$0.02 &  & 0.01$\pm$0.01 & \multicolumn{2}{c}{not predict} \\
& & & latentshift-max & \multirow{-4}{*}{0.78} & 0.02$\pm$0.04 & \multirow{-4}{*}{0.67} & \textbf{0.03$\pm$0.07} &  &  \\
\bottomrule
\end{tabular}
}
\end{table*}

\newpage



\subsection{Cascading randomization analysis}


\citet{Adebayo2018SanityChecks} 
showed that even visually convincing attribution maps could be misleading and only weakly dependent on the network parameters.  We replicate their proposed cascading randomization evaluation. Starting at the classifier end of the network, layer weights are randomized, and the attribution is reevaluated and the correlation computed between the resulting attribution and the original.  Intuitively, one expects that the attribution should rapidly become decorrelated. 
%As latentshift-max by definition takes the absolute value of saliency, we have applied absolute value to all the saliency maps.  
As shown in Figure \ref{fig:cra}, the correlation with the final attribution drops off most rapidly with latentshift-max.  Similarly to the findings in \citeauthor{Adebayo2018SanityChecks}, 
the guided backprop method produces a very similar attribution even as a significant fraction of the model is reinitialized.  The patterns for other pathologies were extremely similar and are shown along with some further details in Appendix \ref{sec:cascading_randomization_appendix}.

\begin{figure}
\centering
    \includegraphics[width=0.6\textwidth]{img/Atelectasis_cascrand.png}
  \caption{\small Correlation between attribution generated by different methods when layers in the network are reset. }
  \label{fig:cra}
\end{figure}

\subsection{Improvement in false positive detection}

We performed a reader study to determine if our method can improve the ability to detect false positive predictions (examples in Appendix \S \ref{sec:falsepositives}) as well as if the features utilized are correct. For this study we recruited two radiologists (A.J. and E.Z., with 2 and 12 years of experience, respectively). They were presented with 240 images twice, each being predicted as having one of 6 pathologies by the XRV-all model (Atelectasis, Cardiomegaly, Effusion, Lung Opacity, Mass, Pneumothorax). Examples were selected such that 50\% were predicted incorrectly by the model (false positives). An incorrect prediction is defined by having a negative label and a $>$50\% prediction by the model which was calibrated such that a 50\% prediction is the operating point of the AUC curve on validation data. 

Each sample is viewed using traditional attribution methods (Method A) and the Latent Shift method (Method B). For Method A the radiologist can see all traditional methods at once (Input gradients, Guided Backprop, and Integrated Gradients). For Method B the radiologist can see both the 2D latentshift-max image as well as a gif annimation side by side. Radiologists were asked the following questions on a 5 point Likert scale: ``How confident are you in the model's prediction? (1-5)" and ``Is the model looking at the correct feature? (1-5)".

The primary study results are shown in Figure \ref{fig:reader-results} and more details can be found in Appendix \S \ref{sec:readerstudy}. 
Overall, for true positive predictions there is a 0.15$\pm$0.95 confidence increase using the Latent Shift method ($p$=0.01 using the Wilcoxon signed-rank test).
For false positive predictions there is a 0.04$\pm$1.06 increase which is not significant ($p$=0.57). We expected false positives to be scored less so these results raise concerns in overconfidence based on model predictions. Although there is the possibility that some of the ground truth labels were wrong.

In the radiologist's feedback (verbatim in Appendix \S \ref{sec:readerfeedback}) they believed that the Latent Shift method was more intuitive and they felt it increased their confidence that the model is looking at the correct feature. They observed that this method looks at the boundaries of the abnormality. One radiologist believed that the model was using the chest tube to predict Pneumothorax instead of looking at the correct area (examples in Appendix \S \ref{sec:extraimages}). This observation is consistent with the IoU analysis and likely because the model input is too low resolution (224x224) to see the small features at the edge of the lung. 


\begin{figure*}[t]
    \centering
    \includegraphics[width=\textwidth]{img/gif-reader-results.jpg}
    \caption{\small A) Responses to the survey questions split by each pathology. B) Regression lines comparing the IoU for each true positive image with the reader responses. }
    \label{fig:reader-results}
\end{figure*}





% In each group the correlation between the perceived confidence and the prediction was correct is calculated. A high correlation would mean that the radiologists were able to detect erroneous predictions.

% Then the difference in correlation between the two groups is compared to see if there is a significant improvement.

\section{Conclusion}

We presented Latent Shift, a simple to implement approach to explain the predictions of models by simulating changes to the input images which increase and decrease the prediction of a classifier. Our approach is designed to be easy to implement in order to increase adoption in other domains and work with existing pre-trained classifiers. 
%This methods holds promise to elucidate what features are used to make further scientific discoveries.

We evaluated Latent Shift and other attribution methods in how well they aligned with ground truth spatial mask information. We found very low IoU values for models with reasonably high AUCs, but with this we cannot conclude which one is in error. The results from our reader study indicate that higher IoU values are correlated with correct features. 
%Another idea to keep in mind here is that even though a 2D attribution map may indicate a model is looking at the right location it doesn't mean that feature used there is what we expect. 

We find that the Latent Shift explanation allows a user to have more confidence in true positive predictions compared to traditional approaches. However, we also found that detecting false positive predictions was challenging, which highlights the need for a stronger radiologist-algorithm symbiosis.


% Acknowledgements should only appear in the accepted version.
\section*{Acknowledgements}

We thank Joseph D. Viviano, Chin-Wei Huang, Lan Dao, Jin Long, Pranav Rajpurkar, William J Sehnert, and Levon Vogelsang for useful discussions.
This research is based on work partially supported by Carestream Health, the CIFAR AI and COVID-19 Catalyst Grants, and by NIH/NIBIB Grants 75N92020D00018 / 75N92020F00001. Some of the computing for this project was performed on the Sherlock cluster. We would like to thank Stanford University and the Stanford Research Computing Center for providing computational resources and support that contributed to these research results.
We thank AcademicTorrents.com for making data available for our research.

\section*{Conflict of interest disclosure}

Ashkay Chaudhari has provided consulting services to Skope MR, Inc., Subtle Medical, Chondrometrics GmbH, Image Analysis Group, Edge Analytics, ICM Co., and Culvert Engineering; and is a shareholder of Subtle Medical, LVIS Corporation, and Brain Key; and is on the advisory board for Chondrometrics GmbH and Brain Key; and receives research support from GE Healthcare and Philips not related to this work.
 
\section*{Ethics}

The study conducted in this research has been approved by the ethical review board at Stanford University.

\bibliography{cohen21}

\appendix
\counterwithin{figure}{section}
\counterwithin{table}{section}

\renewcommand\thefigure{\thesection.\arabic{figure}}   


\newpage
\section{Datasets}
\label{sec:datasets}

\subsection{Autoencoder datasets} 

NIH, PC, RSNA, and MIMIC

\subsection{Classifier datasets}
\begin{itemize}
    \item XRV-all: NIH, PC, CheX,  MIMIC-CXR,  Google,  OpenI,  RSNA 
    \item XRV-mimic\_ch: MIMIC-CXR using the CheXpert labeller
    \item JF Healthcare: CheX
\end{itemize}


\begin{table}[h!]
\caption{Details of datasets used}
\label{tab:datasets}
\resizebox{\columnwidth}{!}{%
\begin{tabular}{llllll}
\toprule
ID & Name & From & Citation & Geographic Region \\
\midrule
RSNA & RSNA Pneumonia Challenge & RSNA & \citet{Shih2019RSNAKaggle} & Northeast USA \\
CheX & CheXpert & Stanford University & \citet{Irvin2019CheXpert} & Western USA  \\
NIH &  ChestX-ray8 & National Institutes of Health & \citet{WangNIH2017} & Northeast USA  \\
Google & Google Labelling of NIH data & Google & \citet{Majkowska2019} & Northeast USA  \\
MIMIC\_CH & MIMIC-CXR with CheX Labels & MIT & \citet{Johnson2019mimic-cxr} & Northeast USA  \\
PC & PadChest & University of Alicante & \citet{Bustos2019PadChest} & Spain  \\
OpenI & OpenI & National Library of Medicine & \citet{Demner-Fushman2016} & USA \\
SIIM & SIIM-ACR Pneumothorax Challenge & SIIM-ACR & \citet{Filice2020SIIMPneumothorax} & Northeast USA\\
\bottomrule
\end{tabular}
}
\end{table}


\begin{table}[h!]
    \centering
    
\caption{Counts of images in each dataset with a positive label for the pathology listed.}
\label{tab:pathology_counts}
\resizebox{\columnwidth}{!}{%
\begin{tabular}{lrrrrrr}
\toprule
Dataset &  Atelectasis &  Cardiomegaly &  Effusion &  Lung Opacity &  Mass &  Pneumothorax \\
\midrule
NIH      &         5728 &          1563 &      6589 &             0 &  3567 &          3407 \\
PC       &         3981 &          8420 &      3342 &             0 &   806 &           223 \\
RSNA     &            0 &             0 &         0 &          1348 &     0 &             0 \\
SIIM     &            0 &             0 &         0 &             0 &     0 &          3576 \\
MIMIC\_CH &        10076 &          9831 &     12064 &         13825 &     0 &          2350 \\
CheX     &         3195 &          2909 &      8078 &          9736 &     0 &          1802 \\
Google   &            0 &             0 &         0 &           221 &     0 &            46 \\
OpenI    &          271 &           185 &       120 &           327 &     6 &            14 \\
\bottomrule
\end{tabular}
}

\end{table}

\begin{table}[h]
    \caption{\small Full listing of counts for bounding boxes and masks available}
    \label{tab:mask_counts}
    \centering
    \begin{tabular}{lccc}
    \toprule
    Dataset & Task &  Mask Type & Count \\
    \midrule
    NIH & Atelectasis  &Bounding Box &            180 \\
    NIH & Effusion     &Bounding Box &            153 \\
    NIH & Cardiomegaly &Bounding Box &            146 \\
    NIH & Infiltration &Bounding Box &            123 (not used)\\
    NIH & Pneumonia    &Bounding Box &            120 (not used)\\
    NIH & Pneumothorax &Bounding Box &             98 (not used) \\
    NIH & Mass         &Bounding Box &             85 \\
    NIH & Nodule       &Bounding Box &             79 (not used)\\
    RSNA & Lung Opacity & Segmentation &        6012 \\
    SIIM-ACR & Pneumothorax & Segmentation & 3576\\
    \bottomrule
    \end{tabular}

\end{table}

\newpage
\section{3D to 2D Construction}
\label{sec:2d3d}


\begin{figure}[h!]
    \centering
    \includegraphics[width=1\textwidth]{img-extra/gif-2d-methods.png}
    \includegraphics[width=1\textwidth]{img-extra/gif-2d-methods-Effusion.png}
    \caption{\small Examples of the different methods to convert the sequence of images into a 2D. It is hard to find any differences even though they are generated in unique ways.}
    \label{fig:2dmethods}
\end{figure}


\begin{itemize}
    \item \textbf{latentshift-mean}: Take the average of all $x_\lambda$ images.
    \item \textbf{latentshift-max}: Take the max distance for each spatial location of all $x_\lambda$ from the image when $\lambda=0$. 
    \item \textbf{latentshift-minmax}: Subtract the lowest $x_\lambda$ from the highest: 
$|x_{\lambda_{\min}} - x_{\lambda_{\max}}|$.
    \item \textbf{latentshift-sliding interval}: compute the difference between each $\lambda$ step and then average them together.
\end{itemize}

\newpage
\section{Lambda Search}
\label{sec:lambdasearch}

\begin{verbatim}
lbound = 0
last_pred = classifier(img)
while True:
    img' = compute_shift(img, lbound)
    last_pred = classifier(img')
    if  last_pred < cur_pred
        or initial_pred-0.5 > cur_pred
        or lbound <= -1000
        break
    last_pred = cur_pred
    lbound = lbound - 10
\end{verbatim}

\newpage
\section{Autoencoder Parameters}
\label{sec:aedesign}

\begin{figure}[h!]
    \centering
    \includegraphics[width=1\textwidth]{img-extra/gif-bottleneck-recon-cardio.png}
    \includegraphics[width=1\textwidth]{img-extra/gif-bottleneck-recon-Effusion.png}
    \caption{\small Varying bottleneck size of the autoencoder. The reconstruction is shown using each autoencoder on the top rows and the latentshift-max method is used to construct a 2D attribution map overlaid on the inout image in the bottom rows.}
    \label{fig:aebottleneckfull}
\end{figure}

\newpage

\begin{figure}[h]
    \centering
    \includegraphics[width=1.0\textwidth]{img-extra/ae-plots-detail.png}
    \caption{\small In A, B, and C each point represents an evaluation of the XRV-all model with a specific autoencoder configuration. The color of each point represents the epoch during training when the evaluation was performed. The model is evaluated on a fixed set of 10 images which contain Cardiomegaly as indicated by NIH bounding boxes. The epoch of training is shown as the color to more fairly compare these networks which converge at different rates. We can see that a larger bottleneck produces a smaller MAE but no strong trend for IoU. In B ResNets of different depths are evaluated and no major trend is found except that potentially a ResNet151 can achieve better IoUs than a ResNet101. However the computational cost is significantly higher and makes this model harder to train. }
    \label{fig:ae}
\end{figure}

\newpage
\section{Extra IoU Comparisons}
\label{sec:extraiou}

There was not space to add this comparison into the main text. We also benchmark the attribution method \textit{Iterative Delete} \citep{Bordes2018Iteratively}. This approach removes the top first order gradients from the image and reprocesses the image iteratively. This evaluation is performed to serve as a more modern baseline to the baseline attribution methods used in this paper.

\begin{figure}[h!]
    \centering
    \includegraphics[width=1\textwidth]{img-extra/gif-2d-overlay-extra.png}
    \includegraphics[width=1\textwidth]{img-extra/gif-2d-overlay-extra-Effusion.png}
    \caption{\small This is an extension of Figure \ref{fig:attributionmaps}. The XRV-all and jfhealthcare models make positive predictions on images for Cardiomegaly. These predictions are explained using multiple 2D attribution maps. A expert bounding box is shown for Cardiomegaly in yellow. No Gaussian blur is applied to these attribution maps.}
    \label{fig:noise}
\end{figure}

\begin{table}[h!]
    \caption{IoU evaluation of the Iterative Delete method. Mean is taken over the same 80 samples used in Table \ref{tab:iou}}
    \label{tab:extraiou}
    \centering
%\resizebox{\columnwidth}{!}
{%
\begin{tabular}{llccc}
\toprule
             & Model & XRV-all & XRV-mimic\_ch & JF Healthcare \\
Target & Method &                     &                          &                          \\
\midrule
Atelectasis & grad &       0.07$\pm$0.07 &            0.06$\pm$0.07 &            0.13$\pm$0.10 \\
             & iterativedelete &       0.05$\pm$0.06 &            0.04$\pm$0.05 &            0.06$\pm$0.07 \\
Cardiomegaly & grad &       0.35$\pm$0.05 &            0.25$\pm$0.09 &            0.45$\pm$0.04 \\
             & iterativedelete &       0.30$\pm$0.05 &            0.26$\pm$0.09 &            0.30$\pm$0.09 \\
Effusion & grad &       0.12$\pm$0.09 &            0.08$\pm$0.08 &            0.18$\pm$0.10 \\
             & iterativedelete &       0.08$\pm$0.07 &            0.09$\pm$0.09 &            0.10$\pm$0.07 \\
Lung Opacity & grad &       0.21$\pm$0.11 &            0.13$\pm$0.09 &                      - \\
             & iterativedelete &       0.17$\pm$0.09 &            0.13$\pm$0.10 &                      - \\
Mass & grad &       0.16$\pm$0.14 &                      - &                      - \\
             & iterativedelete &       0.13$\pm$0.12 &                      - &                      - \\
Pneumothorax & grad &       0.01$\pm$0.02 &            0.01$\pm$0.02 &                      - \\
             & iterativedelete &       0.01$\pm$0.02 &            0.01$\pm$0.03 &                      - \\
\bottomrule
\end{tabular}
}
\end{table}


\newpage
\section{Reader Study}
\label{sec:readerstudy}


\begin{figure}[h!]
    \centering
    \includegraphics[width=0.65\textwidth]{img-extra/trial-homepage.png}

    \caption{\small Screenshots of the primary interface used in the reader study which lists all the images to be studied.}
    \label{fig:interfacemain}
\end{figure}

\begin{figure}[h!]
    \centering
    
\begin{tabular}{@{}p{0.49\textwidth} p{0.49\textwidth}@{}}
  \includegraphics[width=0.48\textwidth]{img-extra/trial-methoda.png} &
  \includegraphics[width=0.48\textwidth]{img-extra/trial-methodb.png}
\end{tabular}
    \caption{\small Screenshots of the per image interface used in the reader study. On the left is the interface with Traditional methods and on the right is when using the Latent Shift method.}
    \label{fig:interfaceviewer}
\end{figure}

\subsection{Reader study feedback}
\label{sec:readerfeedback}

\paragraph{Reader 1} 
Some general observations would be that the new prediction method is more intuitive and for most pathologies increases the confidence that the model is looking at the feature a radiologist would look at to make the diagnosis (except for pneumothorax).  There were some clear examples where the model made the correct prediction but missed salient findings (e.g., cases 199, 200- predicted mass but did not detect some large masses).  Also is interesting that the model in many cases seems to look at the boundaries of an abnormality rather than the actual abnormality or everything else except the abnormality (e.g., contralateral lung) in making predictions so may be a different ``interpretation" style."

\paragraph{Reader 2}

\begin{itemize}
\item Latent Shift (B) does much better than gradients (A) approach. 

\item Within the gradients methods: Image Gradient and Guided Backprop does well, while the highlighted pixels for Integrated Gradients seem to be all over the place (i.e. not good)

\item There is a clear correlation between high output prediction probability and better highlighting of important pixels.

\item The model is really struggling with pneumothorax - both in terms of prediction and in terms of highlighting correct pixels. This goes for both method A and B. FYI, I did not "count" a resolved pneumothorax as a "positive pneumothorax". I am sure the model sometimes predicts pneumothorax just because there is a chest tube."
\end{itemize} 

\begin{figure}[h]
    \centering
    \includegraphics[width=1\textwidth]{img-extra/gif-survey-rel4.png}
    \caption{\small The relationship between the answers compared between the two methods on both true positive and false positive examples. The \textbf{x} mark indicates the mean score.}
    \label{fig:surveyresults1}
\end{figure}

\begin{figure}[h]
    \centering
    \includegraphics[width=1\textwidth]{img-extra/gif-reader-results2.png}
    \caption{\small A different view of the study results showing the counts of survey results.}
    \label{fig:surveyresults2}
\end{figure}

% \newpage
% \section{Attribution overlap and label correlation}
% \label{sec:attrib_overlap}

% \begin{figure}[h!]
%     \centering
%     \includegraphics[width=1\textwidth]{img-extra/attribution-corr.png}
%     \caption{\small Studying the relationship of the IoU  between pathology predictions and the correlation between pathology predictions and ground truth labels. It appears samples from the CheXpert and MIMIC-CXR datasets are not as independent as in other datasets. This bias leaves an impression on the trained models that are trained using that data. It appears the bias is visible moreso on the predictions and not the IoU overlap. These plots are computed on images from the PadChest dataset.}
%     \label{fig:my_label}
% \end{figure}

\newpage
\section{Robustness}

\begin{figure}[h!]
    \centering
    \includegraphics[width=1\textwidth]{img-extra/gif-noise.png}
    \includegraphics[width=1\textwidth]{img-extra/gif-blur.png}
    \caption{\small We study how robust the methods are when adding random noise to the input image at different scales. 80 images are used for evaluation. Images are in the range of [-1024,1024]. Top: Random noise, Bottom: Gaussian Blur.}
    \label{fig:noise-extra}
\end{figure}

\newpage
\section{Cascading Randomization}
\label{sec:cascading_randomization_appendix}

\begin{figure}[h!]
    \centering
    \begin{tabular}{cc}
        \includegraphics[width=0.49\textwidth]{img-extra/Cardiomegaly_cascrand.png}
             &  
        \includegraphics[width=0.49\textwidth]{img-extra/Edema_cascrand.png}
             \\
        \includegraphics[width=0.49\textwidth]{img-extra/Effusion_cascrand.png}
             & 
        \includegraphics[width=0.49\textwidth]{img-extra/Mass_cascrand.png}
    \end{tabular}
    \caption{\small The cascading randomization sanity check recommended by \citet{Adebayo2018SanityChecks}.  The test computes Spearman rank correlation between importance of pixels generated by the attribution map as the network is progressively reinitialized.  Atelectasis shown in \ref{fig:cra}, other patterns are very similar. The value is computed over 40 images from the NIH dataset, error bars show standard deviation of the correlation across these images.  As the latentshift-max method inherently produces an absolute value map, absolute values are taken of all attribution maps before using this method.}
    \label{fig:cascrand}
\end{figure}

\newpage
\section{Example Explanations}
\label{sec:extraimages}


\begin{figure}[h]
    \centering
    \includegraphics[width=\textwidth]{img-extra/gif-examples.jpg}
    \caption{\small Extra images of the Latent Shift method applied to different pathologies}
    \label{fig:extra1}
\end{figure}

\newpage

\begin{figure}[h]
    \centering
    \includegraphics[width=0.9\textwidth]{img-extra/gif-pneumothorax.jpg}
    \caption{\small Example Pneumothorax predictions of both true and false positives. }
    \label{fig:extra2}
\end{figure}



\newpage
\section{False Positive Examples}
\label{sec:falsepositives}


\begin{figure}[h!]
    \centering
    \includegraphics[width=0.9\textwidth]{img-extra/gif-false-positives.jpg}
    \caption{\small Examples of false positive examples used in the reader study. The images are from the NIH dataset. }
    \label{fig:extra}
\end{figure}

% \newpage
% \section{Blank}
% \label{sec:blank}

% This section is left blank so that 

%\stepcounter{section}

\newpage
\section{AE Impact on Classifier Predictions}
\label{sec:aeclfchange}

When the classifier makes a prediction based on the reconstructed image it is often less compared to the input image. The autoencoder reconstruction seems to remove some image features which cause a high prediction.

In order to study the extent of this issue, images for each pathology are evaluated using the classifier and then transformed with the autoencoder (without any $\lambda$ shift) and evaluated using the classifier again. In Table \ref{tab:aepred} the results are shown. Each row corresponds to 500 samples. On average predictions were reduced by 0.12 for images with a positive label and reduced by 0.04 for images with a negative label.


\begin{table}[h]
    \caption{Impact of AE transformation on classification prediction.}
    \label{tab:aepred}
    \centering
    \begin{tabular}{lllcccc}
    \toprule
& & &  Mean & Mean    \\   
Label & Model & Target  &         $f(x)$ &        $f(D(E(x)))$ &      diff \\
\midrule
Positive & XRV-all & Atelectasis &  0.45 &  0.32 &  0.13 \\
    &                          & Cardiomegaly &  0.53 &  0.40 &  0.13 \\
    &                          & Effusion &  0.55 &  0.45 &  0.09 \\
    &                          & Lung Opacity &  0.64 &  0.53 &  0.11 \\
    &                          & Mass &  0.61 &  0.49 &  0.12 \\
    &                          & Pneumothorax &  0.49 &  0.43 &  0.06 \\
    & XRV-mimic\_ch & Atelectasis &  0.55 &  0.42 &  0.13 \\
    &                          & Cardiomegaly &  0.60 &  0.57 &  0.03 \\
    &                          & Effusion &  0.60 &  0.48 &  0.12 \\
    &                          & Lung Opacity &  0.59 &  0.39 &  0.20 \\
    &                          & Pneumothorax &  0.54 &  0.45 &  0.09 \\
    & jfhealthcare & Atelectasis &  0.47 &  0.36 &  0.11 \\
    &                          & Cardiomegaly &  0.70 &  0.50 &  0.19 \\
    &                          & Effusion &  0.54 &  0.32 &  0.22 \\
\midrule
Negative & XRV-all & Atelectasis &  0.33 &  0.22 &  0.11 \\
    &                          & Cardiomegaly &  0.19 &  0.13 &  0.07 \\
    &                          & Effusion &  0.22 &  0.19 &  0.03 \\
    &                          & Lung Opacity &  0.29 &  0.28 &  0.01 \\
    &                          & Mass &  0.41 &  0.37 &  0.05 \\
    &                          & Pneumothorax &  0.34 &  0.28 &  0.06 \\
    & XRV-mimic\_ch & Atelectasis &  0.45 &  0.37 &  0.08 \\
    &                          & Cardiomegaly &  0.42 &  0.40 &  0.02 \\
    &                          & Effusion &  0.33 &  0.26 &  0.07 \\
    &                          & Lung Opacity &  0.38 &  0.24 &  0.14 \\
    &                          & Pneumothorax &  0.46 &  0.40 &  0.06 \\
    & jfhealthcare & Atelectasis &  0.30 &  0.34 & -0.04 \\
    &                          & Cardiomegaly &  0.29 &  0.36 & -0.08 \\
    &                          & Effusion &  0.18 &  0.22 & -0.04 \\
    \bottomrule
    \end{tabular}
\end{table}

\newpage
\section{Limitations}
\label{sec:limitations}

The autoencoder was developed and trained with the goal of representing specific chest X-ray pathologies. Hyperparameters such as the bottleneck size were specifically tuned to represent the pathologies we studied. We would expect that the resulting autoencoder may not represent other pathologies as well.

The images chosen for the reader study were sampled randomly and may contain multiple different pathologies. Readers are instructed to only consider the specific pathology they are told the model predicted and ignore others. 

The models are calibrated such that 0.5 is the operating point of the AUC but often their predictions are lower once they are transformed by the autoencoder. We have performed an analysis to study the extent of this issue in Appendix \ref{sec:aeclfchange}. We find on average predictions were reduced by 0.12 for images with a positive label and reduced by 0.04 for images with a negative label.



\end{document}