\documentclass{midl} % Include author names
% \documentclass[anon]{midl} % Anonymized submission

% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution

\usepackage{mwe} % to get dummy images
% \jmlrvolume{-- Under Review}
\jmlryear{2021}
\jmlrworkshop{Full Paper -- MIDL 2021 submission}
% \editors{Under Review for MIDL 2021}

\usepackage{multirow}
\usepackage{todonotes}\setlength{\marginparwidth}{1.cm}
\definecolor{mycolor}{HTML}{000000}
\newcommand{\rtodo}[1]{\todo[colorred!80]{\scriptsize #1}} % labels reviewer comments
\newcommand\norm[1]{\left\lVert#1\right\rVert} % Norm Symbol

\title[Explainable IQA]{Explainable Image Quality Analysis of Chest X-Rays}

 % Use \Name{Author Name} to specify the name.
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % Two authors with the same address
 % \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\and
 %  \Name{Author Name2} \Email{xyz@sample.edu}\\
 %  \addr Address}

 % Three or more authors with the same address:
 % \midlauthor{\Name{Author Name1} \Email{an1@sample.edu}\\
 %  \Name{Author Name2} \Email{an2@sample.edu}\\
 %  \Name{Author Name3} \Email{an3@sample.edu}\\
 %  \addr Address}


% Authors with different addresses:
% \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\\
% \addr Address 1
% \AND
% \Name{Author Name2} \Email{xyz@sample.edu}\\
% \addr Address 2
% }

%\footnotetext[1]{Contributed equally}

% More complicate cases, e.g. with dual affiliations and joint authorship
% \midlauthor{\Name{Caner Özer\nametag{$^{1}$}} \\
% \addr Istanbul Technical University
% \Email{ozerc@itu.edu.tr}\\ \AND
% \Name{İlkay Öksüz{$^{1, 2}$}} \\
% \addr Istanbul Technical University\\
% \addr King's College, London
% \Email{oksuzilkay@itu.edu.tr}
% }

\midlauthor{\Name{Caner {Ö}zer {$^{1}$}} \Email{ozerc@itu.edu.tr}\\
\Name{{İ}lkay {Ö}ksüz {$^{1, 2}$}} \Email{oksuzilkay@itu.edu.tr}\\
\addr {$^{1}$ Department of Computer Engineering, Istanbul Technical University, Turkey} \\
\addr {$^{2}$ School of Biomedical Engineering \& Imaging Sciences, King\textquotesingle s College London, U.K.}}


\begin{document}

\maketitle

\begin{abstract}
Medical image quality assessment is an important aspect of image acquisition where poor-quality images may lead to misdiagnosis. In addition, manual labelling of image quality after the acquisition is often tedious and can lead to some misleading results. Despite much research on the automated analysis of image quality for tackling this problem, relatively little work has been done for the explanation of the methodologies. In this work, we propose an explainable image quality assessment system and validate our idea on foreign objects in a Chest X-Ray (Object-CXR) dataset. Our explainable pipeline relies on NormGrad, an algorithm, which can efficiently localize the image quality issues with saliency maps of the classifier. We compare our method with a range of saliency detection methods and illustrate the superior performance of NormGrad by obtaining a Pointing Game accuracy of 0.862 on the test dataset of the Object-CXR dataset. We also verify our findings through a qualitative analysis by visualizing attention maps for foreign objects on X-Ray images.
\end{abstract}

\begin{keywords}
Saliency detection, Image Quality Analysis, X-Ray, Foreign Object Detection, NormGrad
\end{keywords}


\section{Introduction}
\label{sec:introduction}

% Image Quality Issues
High medical image quality is essential for extracting clinically meaningful information from medical images. Image artefacts cause degradation of medical image quality CT, \cite{Ma2020}, MRI \cite{Barrett2004} and for X-Ray images \cite{Veldkamp2009} due to various factors. Apart from problems, whose effects can be directly evaluated through objective metrics such as Peak Signal-to-Noise Ratio or Structural Similarity Index, there are additional image quality issues  (e.g. foreign objects inside, breathing artefacts), that require additional manual annotation for automatic detection. There is a need of manual labelling of these objects, which is a lengthy procedure and subject to human errors.

% Chest X-ray
Chest X-Ray is one of the most widely used imaging platforms for disease diagnosis including tuberculosis \cite{Liu2017} and COVID-19 \cite{Oh2020a}. Unfortunately, patients undergoing Chest X-Ray may carry some foreign objects such as buttons or clips on themselves. These result in the foreign object appearance on the images \cite{CXR20}, which makes it difficult to diagnose. In this regard, an automated system that can detect the existence of foreign objects would accelerate the whole image acquisition procedure and enable the reacquistion of these images on the spot. A naïve way to overcome the aforementioned issues is to train a classifier which aims to distinguish between the good and poor quality images. However, the classifier may generate some errors in the form of misclassified images, and interpretation of such errors holds the key in improving the detection accuracy. Unexplainability of the performance of automatic quality assessment techniques challenge the potential clinical translation of such methods. 

% Literature Review
Current state-of-the art solutions in computer vision literature resort to visual explanations either by in-model or post-hoc explanation methods \cite{Singh2020}. On one hand, in-model methods such as attention maps \cite{Schlemper2019} integrate the interpretability method within the model itself. On the other hand, post-hoc explanation methods use a pre-trained model to produce saliency maps in order to demonstrate what the model learned as a result of its training procedure. In this regard, Grad-CAM \cite{Selvaraju2020} uses the gradient-weighted class activation maps to construct a final saliency map. Meanwhile, Score-CAM \cite{Wang2020} comes up with an alternative scheme for activation map weighting in contrast to Grad-CAM. Guided Backpropagation \cite{Springenberg2015} and Input x Gradient \cite{Shrikumar2016} directly use the gradient to construct a saliency map. In addition, there also exist a number of methods \cite{Oh2020b, Schut2020}, which leverage counterfactual analysis, to explain a classifier's decision. As a result, it is possible not only to make a prediction, but it can also provide an explanation to the clinician about the regions that deteriorates the image quality or by pointing the relevant regions which show the disease patterns. For example, \cite{Joshi2020} have proposed a novel segmentation-guided model explanation framework for recognizing Diabetic Macular Edema from Optical Coherence Tomography imaging. Also, \cite{Costa2017} proposed an end-to-end explainable image quality analysis framework on retinal images. 

In this paper, we plan to utilize NormGrad \cite{Rebuffi20} for the detection of saliency maps and illustrate the superior performance of this technique when highlighting the underlying decision for foreign object detection of Chest X-Rays. NormGrad is based on aggregating multiple saliency maps by using Frobenius Norm and introduces a precise saliency map instead of a weighted summation of Grad-CAM. We show that NormGrad provides more accurate saliency maps comparing other well-known methods for saliency extraction on Chest X-Rays, where we verify this claim through qualitative and quantitative analysis. \textcolor{mycolor}{To the best of our knowledge, this work is the first that uses NormGrad in the medical image quality analysis, in addition to being the first paper by performing an explainable image quality analysis on Chest X-Ray data.}




\section{Method}
\label{sec:method}

% In this section, we briefly describe NormGrad framework \cite{Rebuffi20} which is shown for the virtual identity layer choice of convolutional layer with a kernel size of $3$x$3$ in \figureref{fig:normgrad}.
% In this section, we briefly describe Grad-CAM \cite{Selvaraju2020} and NormGrad frameworks \cite{Rebuffi20}, where the latter is shown for the virtual identity layer choice of convolutional layer with a kernel size of $3$x$3$, in \figureref{fig:normgrad}. These methods generally aid us to construct the saliency maps of deep neural network based classifiers.
In this section, we briefly describe Grad-CAM \cite{Selvaraju2020} and NormGrad frameworks \cite{Rebuffi20}. These methods generally aid us to construct the saliency maps of deep neural network based classifiers.

For both of these frameworks, we assume that there exists a pre-trained neural network in which we would like to extract the knowledge about some target layer $k_t$. We also define its preceding layers with $p$ and succeeding with $q$. Given an input image $\textbf{x} \in {\rm I\!R}^{C \times H \times W}$, where $C$ refers to the number of input channels and $H$ and $W$ correspond to the size of the image, we can also define $\textbf{x}^{in} \in {\rm I\!R}^{K \times H' \times W'}$, $\textbf{x}^{out} \in {\rm I\!R}^{K' \times H' \times W'}$, and the network output, $\textbf{y}$, such that

\begin{equation}
\begin{split}
\textbf{x}^{in} = p(\textbf{x})
 \\
 \textbf{x}^{out} = k_t(\textbf{x}^{in})
 \\
 \textbf{y} = q(\textbf{x}^{out}).
\end{split}
\end{equation}
In order to run Grad-CAM, the gradient w.r.t. the parameters of layer $k_t$, $\textbf{g}^{out} \in {\rm I\!R}^{K' \times H' \times W'}$, is needed to be accumulated alongside the activations of the same layer, $\textbf{x}^{out}$. Unless stated otherwise, the gradient is calculated by assuming that $\textbf{y}$ is the ground-truth class label while this may not hold to be true for all samples.

\textcolor{mycolor}{In addition, consider that a virtual identical layer, $\Tilde{k}_t$, is present right after the layer $k_t$, whose output is $\Tilde{\textbf{x}}^{out}$ and satisfies the property $\Tilde{\textbf{x}}^{out} = \textbf{x}^{out}$. The purpose of adding this layer is to assure that activations and gradients are collected from the same position of the network as in other saliency methods. Moreover, this layer can be any type among bias, scaling or convolutional layers. If the framework choice is NormGrad, we accumulate the output activations, $\Tilde{\textbf{x}}^{out}$, and the corresponding upstream gradient, $\textbf{g}^{out}$, of the layer $\Tilde{k}_t$.}


% In addition, assume that a virtual identical layer, $\Tilde{k}_t$, is present right after the layer $k_t$, which satisfies the property $\textbf{x}^{out} = \Tilde{k}_t(\textbf{x}^{out})$. This layer can be any type among bias, scaling or convolutional layers. We accumulate the output activations, $\textbf{x}^{out}$, and the corresponding upstream gradient, $\textbf{g}^{out}$, of the layer $\Tilde{k}_t$.

\subsection{Grad-CAM}
\textcolor{mycolor}{
In general, Grad-CAM constructs the saliency maps in four steps. First, Grad-CAM creates the importance weight vector, $\alpha$, by}

\begin{equation}
    \textcolor{mycolor}{\alpha = \frac{1}{H \times W} \sum_{h, w} g^{out}}
\end{equation}
\textcolor{mycolor}{where $H \times W$ acts as an averaging term across all pixels. In addition, this weight vector provides the importance of the filters in $k_t$. After two operations take place between $\alpha$ vector and $\textbf{x}_{out}$ as in \equationref{eq:gc_spat_contrib}}

\begin{equation}
    \label{eq:gc_spat_contrib}
    \textcolor{mycolor}{S = \sum_k \alpha_{k} \odot \textbf{x}_{k}^{out}}
\end{equation}

\textcolor{mycolor}{where each scalar component in $\alpha$ and each matrix component in $\textbf{x}^{out}$ are multiplied and summated in order to obtain an aggregated spatial contribution, $S$. However, as a consequence, there may exist some inhibited regions with some value less than $0$. In this regard, Grad-CAM saturates the inhibited regions by using ReLU activation function on the aggregated spatial contribution in which we obtain a final heat-map $\textbf{m}$ with a shape of $[H' \times W']$ as in \equationref{eq:gc_relu}.}

\begin{equation}
    \label{eq:gc_relu}
    \textcolor{mycolor}{\textbf{m} = ReLU(S)}
\end{equation}

\textcolor{mycolor}{Finally, Grad-CAM up-samples $\textbf{m}$ to the original input image size which is $[H \times W]$.}
% In general, Grad-CAM constructs the saliency maps in four steps. First, Grad-CAM creates the importance weight vector $\alpha$ by summing the last two dimensions of $\textbf{g}_{out}$ and normalizing it to the total number of pixels. This vector provides the importance of the filters in $k_t$. After that, an element-wise multiplication operation takes place between $\alpha$ vector and $\textbf{x}_{out}$ tensor where each of the $\alpha^i$ is multiplied with $\textbf{x}_{out}^i$ while $i$ referring to the filter index of $k_t$. We denote this output  $S$ such that each of $S_i$ corresponds to the spatial contribution of the $i^{th}$ filter of $k_t$. The way that Grad-CAM aggregates these spatial contributions is by summing them all. However, as a consequence, there may exist some inhibited regions with some value less than $0$. In this regard, Grad-CAM applies saturation by using ReLU activation function on the aggregated spatial contribution to obtain the final heat-map $\textbf{m}$ which has a shape of $[H' \times W']$. We up-sample this heat map to the original input image size, $[H \times W]$.

\subsection{NormGrad}

\begin{figure*}[h!]
    \centering
    \includegraphics[scale=0.60]{images/normgrad.pdf}
    % \vspace{-1.0cm}
    \caption{NormGrad Framework. The top figure shows the flow of a neural network where a virtual identity layer is placed right after the last convolutional layer to accumulate the activations and gradients. The bottom figure demonstrates how they are used to obtain a unified heat map by using NormGrad. Red and blue colors within the heat maps point to regions with high and low values. \textcolor{mycolor}{CNN, GAP and FC stand for Convolutional Neural Networks, Global Average Pooling and Fully-Connected, respectively.}}
    % \vspace{-0.40cm}
    \label{fig:normgrad}
\end{figure*}

\textcolor{mycolor}{In contrast to Grad-CAM, NormGrad exploits $\Tilde{\textbf{x}^{out}}$ and $\textbf{g}^{out}$ directly without estimating a weight vector from the gradients or activations. Moreover, it uses in-place virtual identity layers to represent different ways of obtaining spatial contributions. In \tableref{tab:grads}, we provide the analytic derivations of the spatial contributions and their corresponding matrix/tensor sizes, depending on the virtual identity layer of choice. The first row corresponds to a virtual bias layer, which is directly equal to the upstream gradient, $\textbf{g}^{out}$, and the second row corresponds to a virtual scaling layer that is equal to the element-wise product of activations and gradients. However, some additional operations are needed to be used for calculating the spatial contribution of a $N \times N$ convolutional layer. Suppose that we express the convolution operation using matrix multiplication as}
\begin{equation}
    \textcolor{mycolor}{\Tilde{\textbf{X}}^{out} = \Tilde{\textbf{W}} \textbf{X}^{out}_{N \times N},}
\end{equation}
\textcolor{mycolor}{where $\Tilde{\textbf{W}} \in {\rm I\!R}^{K' \times N^2K}$ denotes the parameters of the layer, $\Tilde{k}_t$, $\textbf{X}^{out}_{N \times N} \in {\rm I\!R}^{N^2K' \times H'W'}$ is the unfolded version of $\textbf{x}_{out}$, and $\Tilde{\textbf{X}}^{out} \in {\rm I\!R}^{K' \times H'W'}$ is the output of the convolution operation. The unfold operation is used to extract $N \times N$ patches from $\textbf{x}_{out}$ which is mainly used to speed-up gradient calculation procedure. Each column of $\textbf{X}^{out}_{N \times N}$ is denoted by $\textbf{x}^{out}_{u, N \times N} \in {\rm I\!R}^{N^2K'}$ and used in the gradient of loss w.r.t. $\Tilde{\textbf{W}}$ that}
\begin{equation}
    \textcolor{mycolor}{\frac{dL}{d\Tilde{\textbf{W}}} = \sum_{u \in \Omega} \frac{d}{d\Tilde{\textbf{W}}} \langle \textbf{g}^{out}_u, \Tilde{\textbf{W}} \textbf{x}^{out}_{u, N \times N} \rangle = \sum_{u \in \Omega} \textbf{g}_u^{out} {\textbf{x}^{out}_{u, N \times N}}^\intercal.}
\end{equation}

% \begin{equation}
% \begin{split}
% \textbf{x}^{in} = p(\textbf{x})
%  \\
%  \textbf{x}^{out} = k_t(\textbf{x}^{in})
%  \\
%  \textbf{y} = q(\textbf{x}^{out}).
% \end{split}
% \end{equation}

% After gathering the activations and gradients of $\Tilde{k_t}$, we determine the spatial contributions depending on the virtual identity layer type. In \tableref{tab:grads}, we provide the analytic derivations of the spatial contributions and their corresponding matrix/tensor sizes. We used the notation of \cite{Laue2020} which simply explains generic tensor multiplication in the form of $C = A *_{(s_1, s_2, s_3)} B$ that $s_1$, $s_2$ and $s_3$ are the index set order of the tensors $A$, $B$ and $C$, respectively. In this regard, the spatial contribution is $\Tilde{\textbf{g}}^{out}$ for a bias virtual identity layer, while it is the element-wise product of $\Tilde{\textbf{g}}^{out}$ and $\Tilde{\textbf{x}}^{out}$ for a scaling layer. However, when we calculate the spatial contribution of a convolutional layer as with a kernel size $N x N$, we first need to unfold $\Tilde{\textbf{x}}^{out}$ as in \figureref{fig:normgrad}. We also need to vectorize the last two dimensions of $\Tilde{\textbf{x}}^{out}$ and $\Tilde{\textbf{g}}^{out}$ tensors for simplicity. The former will have a shape of $[N^2K' \times H'W']$ whereas the latter have $[K' \times H'W']$. Now that we have two matrices, we can take the outer product of the column vectors as provided in the last row of \tableref{tab:grads}.

\textcolor{mycolor}{The next step is to aggregate these spatial contributions in order to obtain a unified heat-map, \textbf{m}, which has a shape of $[H' \times W']$. NormGrad uses $L^2$/Frobenius Norm as the aggregation function aiming to effectively handle the shape of each of the spatial contributions of a convolutional layer, which is a matrix. In the last column of \tableref{tab:grads}, we show the analytical formulas for saliency maps to be generated by using $L^2$ Norm for bias and scaling layers, and Frobenius Norm for the Conv $N \times N$ layers. Following this aggregation procedure, NormGrad up-samples $\textbf{m}$ to the original image size, $[H \times W]$ and obtain our final output for the single layer scenario of NormGrad.}

\textcolor{mycolor}{In addition, it is also possible to add more than a single virtual identity layer inside the network, and combine all of the saliency maps generated by using the gradients and activations of these layers. In this paper, we selected uniform combination setting for heat-map combination which calculates the geometric mean of the given heat-maps. Additionally, we use the same type of virtual identity layers which ensures a fair assessment of performance. Given $J$ heat-maps prior to aggregation, we can obtain the combined heat-map, $\textbf{M}$, such by,}
\begin{equation}
    \textcolor{mycolor}{\textbf{M} = \Pi_{j=1}^J \sqrt[J]{\textbf{m}_j}.}
\end{equation}
\textcolor{mycolor}{\textbf{M} will be the final output for the combined layer scenario of NormGrad.}

% The next step is to aggregate the spatial contributions after reshaping them to $[N^2K'^2 \times H' \times W']$. We use Frobenius Norm in order to obtain a unified heat-map, $\textbf{m}$, which has a shape of $[H' \times W']$. Finally, we up-sample $\textbf{m}$ to the original image size, $[H \times W]$ and obtain our final output for the single layer scenario of NormGrad. In addition, it is also possible to add more than a single virtual identity layer inside the network. In this paper, we selected the uniform combination setting for heat-map combination. Given $J$ heat-maps prior to aggregation, we can obtain the aggregated heat-map, $\textbf{M}$, by $\Pi_{j=1}^J \sqrt[J]{\textbf{m}_j}$. $\textbf{M}$ will be the final output for the combined layer scenario of NormGrad.

\begin{table}[]
  \caption{Spatial contributions, shapes and \textcolor{mycolor}{saliency map formulas} of different virtual identity layer choices.}
    \centering
        \begin{tabular}{|r|c|c|c|}
        \hline
        \textbf{Layer}        & \textbf{Spatial Contribution}                               & \textbf{Shape}        & \textcolor{mycolor}{\textbf{Saliency Map}}            \\ \hline
        Bias         & $\textbf{g}^{out}_u$                         & $K'$   & \norm{ \textbf{g}^{out} }                   \\ \hline
        Scaling      & $\textbf{g}^{out}_u \odot \textbf{x}^{out}_u$  & $K'$ & \norm{\textbf{g}^{out} \odot \textbf{x}^{out}}   \\ \hline
        Conv $N \times N$ & $\textbf{g}^{out}_u  {\textbf{x}^{out}_{u, N \times N}}^\intercal$      & $K' \times N^2K'$ & \norm{\textbf{g}^{out}} \norm{\textbf{x}^{out}_{N \times N}}\\ \hline
        \end{tabular}
    \label{tab:grads}
\end{table}


\section{Experimental Results}
\label{sec:experiments}

\begin{figure*}[h!]
    \centering
        \begin{minipage}[b]{0.24\linewidth}
          \centering
          \centerline{\includegraphics[width=3.5cm]{images/09015.jpg}}
          \medskip
        \end{minipage}
        \hfill
        \begin{minipage}[b]{.24\linewidth}
          \centering
          \centerline{\includegraphics[width=3.5cm]{images/09015_ng_conv1x1.png}}
          \medskip
        \end{minipage}
        \hfill
        \begin{minipage}[b]{.24\linewidth}
          \centering
          \centerline{\includegraphics[width=3.5cm]{images/09015_gradcam.png}}
          \medskip
        \end{minipage}
        \hfill
        \begin{minipage}[b]{0.24\linewidth}
          \centering
          \centerline{\includegraphics[width=3.5cm]{images/09015_ng_comb_conv1x1.png}}
          \medskip
        \end{minipage}
        \hfill
    % \vspace{-0.50cm}
    \caption{Attention maps of an image when there are multiple foreign objects to be detected. From left to right: Original Image, NormGrad conv1x1 single, Grad-CAM, NormGrad conv1x1 combined. Foreign objects outwards the lungs have not been annotated with a bounding-box in this dataset. (Best viewed zoomed in color)
    % \todo[inline]{IO: Bu rakam nedir burda?  Bu figürün mesajı hiç nalaşımıyor. Daha iyi düzenlememizi lazım bu figürü. Sadece en iyi normgrad ve gradcam sonucu karşılaştırmalı verilebilir (gerçi figür 3de vermişiz). }
    }
    % \vspace{-0.20cm}

    \label{fig:results}
\end{figure*}

We perform our experiments on Object-CXR dataset \cite{CXR20} which is a benchmarking dataset for foreign object recognition and localization on Chest X-Ray images. This dataset consists of 10,000 Chest X-Ray images in total, which 5,000 of them include foreign objects and 5,000 of them do not. We make the code and the experiments available on \textbf{https://github.com/canerozer/explainable-iqa}.

% \todo{IO: Resized to what?? CO: Resolved}
%\todo{Pre-trained on Imagnet or not}
%\todo{Are these selection optimized? "chosen to be" is not a phrase we can use in the paper}
\subsection{Image Quality Classification with ResNet-34 model} 

Our trained model for this benchmarking is a ResNet-34 model which takes a $600 \times 600 \times 3$ input image and recognizes whether there is at least a single foreign object in the image or not. We fine-tune the ResNet-34 model, which was pre-trained on ImageNet dataset \cite{Deng2009}, for $20$ epochs using a batch size of $16$ and cross-entropy loss function. In this regard, we duplicated the channel axis of the input image for $3$ times, replaced the last layer of the ResNet-34 model, which now has $2$ output neurons, and inherited the remaining layers from the pre-trained model. 
The optimization function is stochastic gradient descent with momentum where its learning rate is defined as 0.005.
 We also reduce the learning rate by dividing it to 10 in every 5 epochs. Lastly, we use color jittering, affine transformations and horizontal flips as the choice of data augmentations during training. With this model, we achieve an AUC score of 0.937 on the testing split of the Object-CXR dataset.

\subsection{Experimental Setup for attention maps}

We present our qualitative and quantitative analysis on the validation and testing splits of the Object-CXR dataset with 1,000 images for each split, which includes 500 images containing foreign objects and 500 of them do not. In this manner, we first identicate the differences in attention maps, when we modify the virtual identity layer type of the NormGrad framework. We do not only provide the results for a single heat-map but also we include the results when we combine 4 different heat-maps as well. Then, we compare the best performing NormGrad settings with other baselines such as Grad-CAM \cite{Selvaraju2020}, Guided Grad-CAM \cite{Selvaraju2020}, Guided Backpropagation \cite{Springenberg2015} and Input x Gradient \cite{Shrikumar2016}. In order to provide a fair comparison among methods, we intentionally use the last convolutional block of ResNet-34, namely, layer4.2 for both Grad-CAM and single-layer setting of NormGrad. In addition, we also examine the combined-layer setting of NormGrad by involving layer2.0, layer3.0, layer4.0, and layer4.2 of ResNet-34. Qualitative results of NormGrad are obtained by using the virtual conv$1$x$1$ layer.

\subsection{Qualitative Results} 

\figureref{fig:results} demonstrates an example with multiple foreign objects to compare the outputs of NormGrad and Grad-CAM. We notice that Grad-CAM provides attention maps on the target objects of interest with some offset. We define them as \textit{skewed} attention maps, which are characterized by some circular sectors outwards the target objects. Furthermore, some salient regions do not intersect with any target objects, and for this reason, they are considered false positives. In stark contrast, a significant reduction in \textit{skewness} of the attention maps on the target objects and fewer false positives suggest that the single-layer setting of NormGrad focuses more accurately than Grad-CAM on the target foreign objects. 
Besides, when we use the combined-setting of NormGrad, we notice a sharp fall in the area of activity since earlier layers inhibit the heat-map. However, this condition may also completely suppress the heat-map activity on some target objects like the one at the top-right. Nevertheless, the combined heat-map is promising for a potential weakly-supervised semantic segmentation task of segmenting the foreign objects. It is because of NormGrad providing more reliable and precise attention maps than Grad-CAM. 

In \figureref{fig:results_misclassification} in Appendix \ref{apx:last}, we present a representative example from the Object-CXR dataset when there exists a misclassification by the model. Although our ResNet-34 model has not predicted any foreign objects in this image, the ground-truth label suggests the reverse. Consequently, it is possible to observe this misclassification error within the attention map of Grad-CAM. This method points to all regions except for the region represented by the bounding box. Hence, Grad-CAM is not robust enough to handle the misclassification errors. In stark contrast, NormGrad overcomes this problem by attending only to the foreign object.

\subsection{Quantitative Results} 

We perform a quantitative analysis on the Pointing Game \cite{Zhang16} which aims to detect whether saliency maps align with the ground-truth bounding boxes. In Pointing Game, if the location of the maximum value of a saliency map is close to either one of the bounding box annotations with a pixel offset value, $\tau$, the saliency map is considered to be accurate. By defining these accurate saliency maps with $T$ and those who not with $F$, an accuracy metric $A$ is defined such that $A = \frac{T}{T + F}$.  In \tableref{tab:pointing_game}, we report the quantitative results for methods of comparison and the ablation study of the proposed NormGrad setup for $\tau=25$. We see that almost all of the NormGrad settings outperform other baseline settings in both validation and testing splits of Object-CXR with maximum accuracies of 0.880 and 0.862, respectively. The difference of NormGrad can be explained by its spatial contribution aggregation scheme, Frobenius Norm, especially after comparing it with Grad-CAM. In addition, there is only a slight difference among the available settings of NormGrad, except for the bias layer. \textcolor{mycolor}{When the virtual bias layer is placed at the of the network, it is unable to exploit activations while generating the heat-maps. Hence, it generates a constant spatial contribution across the heat-map since the upstream gradient is a single scalar for the last layer. However, this problem can be handled by placing additional virtual bias layers within the intermediate layers of the network. As we combine 4 heat-maps using these virtual bias layers, we are able to increase the accuracy from $0.120$ to $0.776$ for the validation split, from $0.112$ to $0.766$ for the testing split of the Object CXR dataset.}

\begin{table}[h!]
\centering
\caption{Pointing Game results on the validation and testing splits of Object CXR comparing the other baseline methods.}
\begin{tabular}{|l|c|c|c|c|}
\hline
\textbf{Method}                                                           & \textbf{Layer Type}                & \textbf{Heat-map}              & \textbf{Val}                        & \textbf{Test}                       \\ \hline
\multirow{8}{*}{NormGrad}                                        & \multirow{2}{*}{Scaling}  & Single       & 0.878                      & 0.854                      \\ \cline{3-5} 
                                                                 &                           & Combined     & 0.878                      & 0.846                      \\ \cline{2-5} 
                                                                 & \multirow{2}{*}{Conv 1x1} & Single       & 0.876                      & 0.856                      \\ \cline{3-5} 
                                                                 &                           & Combined     & \textbf{0.880}                      & 0.846                      \\ \cline{2-5} 
                                                                 & \multirow{2}{*}{Conv 3x3} & Single       & 0.874                      & \textbf{0.862}                      \\ \cline{3-5} 
                                                                 &                           & Combined     & \textbf{0.880}                      & 0.850                      \\ \cline{2-5} 
                                                                 & \multirow{2}{*}{Bias}     & Single       & 0.120                      & 0.112                      \\ \cline{3-5} 
                                                                 &                           & Combined     & 0.776                      & 0.766                      \\ \hline
InputxGrad                                                 & \multicolumn{1}{c|}{-}     & \multicolumn{1}{c|}{-} & \multicolumn{1}{l|}{0.246} & \multicolumn{1}{l|}{0.240} \\ \hline
\begin{tabular}[c]{@{}l@{}}Guided BP\end{tabular} & \multicolumn{1}{c|}{-}     & \multicolumn{1}{c|}{-} & \multicolumn{1}{l|}{0.208} & \multicolumn{1}{l|}{0.188} \\ \hline
\begin{tabular}[c]{@{}l@{}}Guided\\ Grad-CAM\end{tabular}        & \multicolumn{1}{c|}{-}     & \multicolumn{1}{c|}{-} & \multicolumn{1}{l|}{0.348} & \multicolumn{1}{l|}{0.334} \\ \hline
Grad-CAM                                                         & \multicolumn{1}{c|}{-}     & \multicolumn{1}{c|}{-} & \multicolumn{1}{l|}{0.684} & \multicolumn{1}{l|}{0.656} \\ \hline
\end{tabular}

% \vspace{-0.65cm}
\label{tab:pointing_game}
\end{table}

\section{Discussion and Conclusions}
In this work, we proposed an automatized framework for medical image quality analysis, for which we used NormGrad to explain the decisions of the framework. We empirically showed that NormGrad provides more accurate saliency maps, which are centered across the target object of interest, after comparing the other saliency methods. Furthermore, the number of false positives falls as a result of using NormGrad instead of Grad-CAM. We also noticed that NormGrad is more robust to model-based misclassification errors after comparing it to the Grad-CAM's saliency maps. We additionally validated these findings quantitatively through the Pointing Game benchmark, where we obtained superior performance after comparing other saliency detection methods on the Object-CXR dataset.

NormGrad achieves significant performance improvements in our qualitative and quantitative assessments. Interestingly, our findings are opposing the claims shown in \cite{Wang2020b} which uses the same accuracy metric presented in this work. However, the performance of NormGrad is also dependent on its configuration, namely, the virtual identity layer choice, if and how the heat-map combination scheme is applied. As a result, NormGrad is a promising method for neural network interpretability, \textcolor{mycolor}{specifically for Chest X-Ray image quality analysis.}

As future work, we are interested in assessing the effect of combining the heat-maps of different layers and investigating the re-weighting options for heat-map combination. Besides, we would like to perform further experiments on different image modalities, e.g., CT and MR. Lastly, we would like to enhance the NormGrad framework by providing medical insight.

\midlacknowledgments{We thank Mehmet Ozan Unal for his contribution to the development of the pre-trained model. This paper has been produced benefiting from the 2232 International Fellowship for Outstanding Researchers Program of TUBITAK (Project No: 118C353). However, the entire responsibility of the publication/paper belongs to the owner of the paper. The financial support received from TUBITAK does not mean that the content of the publication is approved in a scientific sense by TUBITAK.}

\bibliography{ozer21}


\newpage

\appendix

\section{Quantitative Results}

\begin{figure}[h!]
        \begin{minipage}{0.32\columnwidth}
          \raggedleft
          \centerline{\includegraphics[width=4.75cm, height=4.75cm]{images/09353.jpg}}
          \medskip
        \end{minipage}
        \begin{minipage}{0.33\columnwidth}
          \centering
          \centerline{\includegraphics[width=4.75cm]{images/09353_normgrad.png}}
          \medskip
        \end{minipage}
        \begin{minipage}{0.33\columnwidth}
          \raggedright
          \centerline{\includegraphics[width=4.75cm]{images/09353_gradcam.png}}
          \medskip
        \end{minipage}
        \begin{minipage}{0.32\columnwidth}
          \raggedleft
          \centerline{\includegraphics[width=4.75cm, height=4.75cm]{images/09357.jpg}}
          \medskip
        \end{minipage}
        \begin{minipage}{0.33\columnwidth}
          \centering
          \centerline{\includegraphics[width=4.75cm]{images/09357_normgrad.png}}
          \medskip
        \end{minipage}
        \begin{minipage}{0.33\columnwidth}
          \raggedright
          \centerline{\includegraphics[width=4.75cm]{images/09357_gradcam.png}}
          \medskip
        \end{minipage}
        \begin{minipage}{0.32\columnwidth}
          \raggedleft
          \centerline{\includegraphics[width=4.75cm, height=4.75cm]{images/09361.jpg}}
          \medskip
        \end{minipage}
        \begin{minipage}{0.33\columnwidth}
          \centering
          \centerline{\includegraphics[width=4.75cm]{images/09361_normgrad.png}}
          \medskip
        \end{minipage}
        \begin{minipage}{0.33\columnwidth}
          \raggedright
          \centerline{\includegraphics[width=4.75cm]{images/09361_gradcam.png}}
          \medskip
        \end{minipage}

    \caption{More samples regarding foreign object localization with NormGrad and Grad-CAM. From left to right: Original Image, NormGrad conv3x3 single, Grad-CAM.}
    \label{fig:visualresults_appendix}
\end{figure}
\newpage

\section{NormGrad Failures}

\begin{figure}[h!]
        \begin{minipage}{0.32\columnwidth}
          \raggedleft
          \centerline{\includegraphics[width=4.75cm, height=4.75cm]{images/09070.jpg}}
          \medskip
        \end{minipage}
        \begin{minipage}{0.33\columnwidth}
          \centering
          \centerline{\includegraphics[width=4.75cm]{images/09070_ng.png}}
          \medskip
        \end{minipage}
        \begin{minipage}{0.33\columnwidth}
          \raggedright
          \centerline{\includegraphics[width=4.75cm]{images/09070_gc.png}}
          \medskip
        \end{minipage}
        \begin{minipage}{0.32\columnwidth}
          \raggedleft
          \centerline{\includegraphics[width=4.75cm, height=4.75cm]{images/09145.jpg}}
          \medskip
        \end{minipage}
        \begin{minipage}{0.33\columnwidth}
          \centering
          \centerline{\includegraphics[width=4.75cm]{images/09145_ng.png}}
          \medskip
        \end{minipage}
        \begin{minipage}{0.33\columnwidth}
          \raggedright
          \centerline{\includegraphics[width=4.75cm]{images/09145_gc.png}}
          \medskip
        \end{minipage}
        \begin{minipage}{0.32\columnwidth}
          \raggedleft
          \centerline{\includegraphics[width=4.75cm, height=4.75cm]{images/09529.jpg}}
          \medskip
        \end{minipage}
        \begin{minipage}{0.33\columnwidth}
          \centering
          \centerline{\includegraphics[width=4.75cm]{images/09529_ng.png}}
          \medskip
        \end{minipage}
        \begin{minipage}{0.33\columnwidth}
          \raggedright
          \centerline{\includegraphics[width=4.75cm]{images/09529_gc.png}}
          \medskip
        \end{minipage}

    \caption{Some sample images where NormGrad fails to detect all the foreign objects. From left to right: Original Image, NormGrad conv3x3 single, Grad-CAM.}
    \label{fig:results_appendix}
\end{figure}

\newpage
\section{False Positive Comparison}

\begin{figure}[h!]
        \begin{minipage}{0.32\columnwidth}
          \raggedleft
          \centerline{\includegraphics[width=4.75cm, height=4.75cm]{images/09051.jpg}}
          \medskip
        \end{minipage}
        \begin{minipage}{0.33\columnwidth}
          \centering
          \centerline{\includegraphics[width=4.75cm]{images/09051_ng.png}}
          \medskip
        \end{minipage}
        \begin{minipage}{0.33\columnwidth}
          \raggedright
          \centerline{\includegraphics[width=4.75cm]{images/09051_gc.png}}
          \medskip
        \end{minipage}
        \begin{minipage}{0.32\columnwidth}
          \raggedleft
          \centerline{\includegraphics[width=4.75cm, height=4.75cm]{images/09055.jpg}}
          \medskip
        \end{minipage}
        \begin{minipage}{0.33\columnwidth}
          \centering
          \centerline{\includegraphics[width=4.75cm]{images/09055_ng.png}}
          \medskip
        \end{minipage}
        \begin{minipage}{0.33\columnwidth}
          \raggedright
          \centerline{\includegraphics[width=4.75cm]{images/09055_gc.png}}
          \medskip
        \end{minipage}
        \begin{minipage}{0.32\columnwidth}
          \raggedleft
          \centerline{\includegraphics[width=4.75cm, height=4.75cm]{images/09060.jpg}}
          \medskip
        \end{minipage}
        \begin{minipage}{0.33\columnwidth}
          \centering
          \centerline{\includegraphics[width=4.75cm]{images/09060_ng.png}}
          \medskip
        \end{minipage}
        \begin{minipage}{0.33\columnwidth}
          \raggedright
          \centerline{\includegraphics[width=4.75cm]{images/09060_gc.png}}
          \medskip
        \end{minipage}

    \caption{Some sample images where NormGrad has regions with less number of false positives. From left to right: Original Image, NormGrad conv3x3 single, Grad-CAM.}
    \label{fig:results_appendix}
\end{figure}

\newpage
\section{Extracging Misclassifications}
\label{apx:last}
\begin{figure}[h!]
        \begin{minipage}[b]{0.32\columnwidth}
          \raggedleft
          \centerline{\includegraphics[width=4.75cm]{images/09010_resized_cropped.png}}
        %  \vspace{1.5cm}
          \medskip
        \end{minipage}
        \begin{minipage}[b]{0.33\columnwidth}
          \centering
          \centerline{\includegraphics[width=4.75cm]{images/09010_ng_conv3x3_cropped.png}}
        %  \vspace{1.5cm}
          \medskip
        \end{minipage}
        \begin{minipage}[b]{0.33\columnwidth}
          \raggedright
          \centerline{\includegraphics[width=4.75cm]{images/09010_gradcam_cropped.png}}
        %  \vspace{1.5cm}
          \medskip
        \end{minipage}
    % \vspace{-2.25cm}
    \caption{Attention maps across different saliency and attribution methods when the image is misclassified. Original Image, NormGrad conv3x3 single, Grad-CAM. Image has been cropped after obtaining saliency maps. }
    \label{fig:results_misclassification}
    % \vspace{-0.20cm}
\end{figure}

\end{document}
