\documentclass{midl} % Include author names
% \documentclass[anon]{midl} % Anonymized submission

% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution
\usepackage{multirow}
\usepackage{mwe} % to get dummy images
% \jmlrvolume{-- Under Review}
\jmlryear{2020}
\jmlrworkshop{Full Paper -- MIDL 2020}
% \editors{Under Review for MIDL 2020}

\title[KD-MRI: Knowledge distillation for MRI]{KD-MRI: A knowledge distillation framework for image reconstruction and image restoration in MRI workflow}

\midlauthor{\Name{Balamurali Murugesan  \nametag{$^{1,2}$}} \Email{balamurali@htic.iitm.ac.in}\\ 
\addr $^{1}$ Indian Institute of Technology Madras (IITM), India \\
\addr $^{2}$ Healthcare Technology Innovation Centre (HTIC), IITM, India \AND
\Name{Sricharan Vijayarangan \nametag{$^{2}$}}\midljointauthortext{Equal contribution.}  \Email{sricharanv@htic.iitm.ac.in} \\
\Name{Kaushik Sarveswaran \nametag{$^{2}$}}\midlotherjointauthor \midljointauthortext{Work done while interning at HTIC.} \Email{kaushik3497@yahoo.co.in} \\ 
\Name{Keerthi Ram \nametag{$^{2}$}} \Email{keerthi@htic.iitm.ac.in} \\
\Name{Mohanasankar Sivaprakasam \nametag{$^{1,2}$}} \Email{mohan@ee.iitm.ac.in} \\
}


\begin{document}
\maketitle

\begin{abstract}
Deep learning networks are being developed in every stage of the MRI workflow and have provided state-of-the-art results. However, this has come at the cost of increased computation requirement and storage. Hence, replacing the networks with compact models at various stages in the MRI workflow can significantly reduce the required storage space and provide considerable speedup. In computer vision, knowledge distillation is a commonly used method for model compression. In our work, we propose a knowledge distillation (KD) framework for the image to image problems in the MRI workflow in order to develop compact, low-parameter models without a significant drop in performance. We propose a combination of the attention-based feature distillation method and imitation loss and demonstrate its effectiveness on the popular MRI reconstruction architecture, DC-CNN. We conduct extensive experiments using Cardiac, Brain, and Knee MRI datasets for 4x, 5x and 8x accelerations. We observed that the student network trained with the assistance of the teacher using our proposed KD framework provided significant improvement over the student network trained without assistance across all the datasets and acceleration factors. Specifically, for the Knee dataset, the student network achieves $65\%$ parameter reduction, 2x faster CPU running time, and 1.5x faster GPU running time compared to the teacher. Furthermore, we compare our attention-based feature distillation method with other feature distillation methods. We also conduct an ablative study to understand the significance of attention-based distillation and imitation loss. We also extend our KD framework for MRI super-resolution and show encouraging results. 

% Deep learning networks are being developed individually for every stage in the MRI workflow. These networks have shown state-of-the-art results at the cost of storage and computation. Hence, replacing the networks with compact models (via model compression) at various stages in the MRI workflow can significantly reduce the storage space and provide considerable speedup at the cost of significant drop in performance. In our work, we use knowledge distillation (KD) to develop compact models. KD is the process of transferring knowledge from the teacher to student, thus enabling the student to improve . We propose the following methods to perform knowledge distillation: 1) An attention-based feature distillation method which allows the student to learn the intermediate representation of teacher and 2) Imitation loss to act as the regularizer. We show the effectiveness of our method on the popular MRI reconstruction architecture DC-CNN. We conduct extensive experiments using Cardiac, Brain, and Knee MRI datasets for 4x, 5x and 8x accelerations. We observed that the compact student network when trained with our proposed KD framework resulted in a significant improvement over the pre-KD student network across all the datasets and acceleration factors. Specifically, for the Knee dataset, the student network achieves $65\%$ parameter reduction, 2x faster CPU running time, and 1.5x faster GPU running time compared to the teacher. We compare our attention-based feature distillation method with other feature distillation methods. We also conduct an ablative study to understand the significance of attention-based distillation and imitation loss.
\end{abstract}

\begin{keywords}
MRI workflow, Model compression, Knowledge distillation, MRI reconstruction, MRI super resolution. 
\end{keywords}


\section{Introduction}
Magnetic Resonance Imaging (MRI) workflow consists of image acquisition, reconstruction, restoration, registration and analysis \cite{mri_survey}. In every stage of the MRI pipeline, deep learning networks have shown encouraging results and are being integrated into the medical workflow \cite{deeplearning_medical}. This integration demands larger storage and compute power as the improved performance of deep networks come at the cost of computation and storage. Consequently, hospitals which are already burdened with storing large medical records will now have to allocate additional storage for the deep learning models. Furthermore, with the advent of patient-specific care \cite{patient_specific} and federated learning \cite{federated_learning}, the need for storage and compute power will continue to increase. 

% Currently, deep learning models are developed individually for each stage in the MRI workflow. Additionally, in each stage, separate models are required for every modality although, there are networks which can work with multiple modalities at the cost of a sub-optimal performance. For instance, Medical Image Segmentation generally requires distinct expert models for a range of modalities, such as brain, kidney, prostate, spine etc. Recent challenges like Medical Segmentation Decathlon have tried to develop a single network for the same. Similarly, in image reconstruction and restoration, numerous configurations exist depending upon the number of factors which influence a given modality. In reconstruction, choice of modality, acceleration factor and the type of undersampling mask are some of the factors which are to be considered when developing a model, while in restoration, specifically super-resolution, modality, scale factor are some factors. Considering the plethora of configurations, modalities and stages to be considered in a MRI workflow, the cost of deploying existing state-of-the-art deep networks at each stage is an exponential increase in memory and computation. Hence, there is a need for energy-efficient model development, with a promising direction being model compression.

% Currently, deep learning models are developed individually for each stage in the MRI workflow. Additionally,  In each stage, training a model for a particular task such as reconstruction involves taking numerous configurations into account, which are a combination of the various factors influencing the task. For instance, in MRI reconstruction, anatomical study (brain, kidney, prostate, spine), contrast (T1, T2, PD), undersampling mask (cartesian, gaussian) and acceleration (2x,4x,8x) are some of the factors to be considered. Due to the plethora of configurations and tasks to be considered in an MRI workflow, the cost of deploying existing state-of-the-art deep networks at each stage accumulates to an exponential increase in memory and computation. Hence, there is a pressing need for energy-efficient model development, with a promising direction being model compression.

% Currently, deep learning models are developed individually for each stage in the MRI workflow. Additionally, in each stage, training a model for a particular task such as reconstruction or super-resolution involves taking into account numerous configurations, which are a combination of the various factors influencing the task. For instance, in MRI reconstruction, anatomical study, contrast, type of undersampling mask and acceleration are some of the factors; while contrast and scale factor are primary factors in MRI super-resolution. Furthermore, separate models are required for every anatomical study; although there are networks which can work with multiple anatomies at the cost of a sub-optimal performance rate. To illustrate, Medical Image Segmentation generally requires distinct expert models for a range of anatomies, such as brain, kidney, prostate, spine etc. Recent challenges like Medical Segmentation Decathlon have attempted to propagate research towards designing a single model across anatomies, thereby sacrificing state-of-the-art performance for feasibility in deployment. Due to the plethora of configurations, tasks and modalities to be considered in an MRI workflow, the cost of deploying existing state-of-the-art deep networks at each stage accumulates to an exponential increase in memory and computation. Hence, there is a need for energy-efficient model development, with a promising direction being model compression.

Deep networks are task specific, separate networks are required for image segmentation, image reconstruction, image super-resolution, object detection, etc.. Thereby, for the different tasks in each stage in MRI workflow, individual networks are developed. In addition to the task specific nature of deep learning, they are also dataset specific. Deep learning networks developed for a particular task using a certain dataset might perform poorly on a new dataset from a different distribution. In MRI, dataset is decided by the anatomical study (brain, cardiac, knee) and its respective contrast (T1, T2). So, for every task in MRI workflow, specific deep networks are to be developed with respect to a particular dataset. 
%Research works on domain adaptation have tried to make deep networks work across datasets \cite{domain_adaptation1}. %However, this comes at significant drop in performance. %Despite all the advances in combining networks across tasks and datasets, the number of networks in the MRI pipeline remain immeasurably high. 
Furthermore, for tasks like reconstruction, apart from the choice of dataset, the degradation caused to the input image is varied through different acceleration factor (2x, 4x, 8x) and undersampling mask (cartesian,  gaussian) causing a change in distribution. Due to the plethora of configurations (task, dataset, type of degradation) to be considered in an MRI workflow, the cost of deploying existing state-of-the-art deep networks at each stage accumulates to an exponential increase in memory and computation. Hence, there is a pressing need for memory-efficient model development.
%with model compression being a promising direction. 

Model compression is an actively pursued area of  research over the last few years with the goal of deploying state-of-the-art deep networks in low-power and resource limited devices without significant drop in accuracy \cite{survey_compression}. Parameter pruning, low-rank factorization and weight quantization are some of the proposed methods to compress the size of deep networks. However, these methods may require dedicated hardware or software customization for practical implementation. A promising method to obtain compact models with ease of deployment is Knowledge distillation (KD) \cite{kd}. In KD, the student model (memory efficient network) learns from the powerful teacher model (state-of-the-art network) to improve the student's accuracy which drops due to parameter reduction. In computer vision, KD has been widely developed for image classification tasks \cite{fitnet} \cite{kd_assistant}. Recently, some of the works have focused on applying KD to image segmentation \cite{kd_segmentation} and object detection \cite{regression_object_kd} tasks. These works can be adapted to the MRI analysis stage. In our work, we propose to compress the deep learning models in reconstruction and restoration stage through our novel KD framework. Thereby, the entire MRI workflow can be implemented with efficient storage and computation with significant speed-up. We primarily use MRI reconstruction to demonstrate the effectiveness of our proposed framework. We also extend our framework for MRI super-resolution and obtained encouraging results which are presented in Appendix A. %\ref{sect:kd-mri}
In summary, the following are our contributions: 

\begin{itemize}
    \item We propose an end-to-end trainable framework for learning compact MRI reconstruction networks through knowledge distillation (KD). To the best of our knowledge, this is the first application of KD for the MRI reconstruction problem.
    \item For MRI reconstruction and restoration, we propose an attention-based feature distillation method, which helps the student learn the intermediate representation of the teacher. We also propose combining it with imitation loss function based KD, which acts as a regularizer to the reconstruction loss. 
    \item We demonstrate the effectiveness of our approach using deep cascade of convolutional neural network (DC-CNN). We use DC-CNN with five cascades and three convolution layers as student (S-DC-CNN), five cascades and five convolution layers as teacher (T-DC-CNN). We perform extensive experiments using Cardiac, Brain, and Knee MRI dataset for 4x, 5x, and 8x accelerations. We show that S-DC-CNN trained using our KD method showed consistent improvement of PSNR and SSIM over the S-DC-CNN trained without assistance of the teacher across all datasets and acceleration factors. Considering Knee image reconstruction, S-DC-CNN gives $65\%$ parameter reduction, 2x faster CPU running time, and 1.5x faster GPU running time compared to T-DC-CNN.
    \item We compare our attention based feature distillation method against common feature distillation methods. We observed that our method provides lower validation error and is thus better in transferring teacher's knowledge to student. We also conduct an ablative study to understand the significance of our attention-based feature distillation and imitation loss. We found that attention transfer is the key to KD. 
    % We observe that 
\end{itemize}

\section{Brief Literature Review}
\subsection{MRI reconstruction}
MRI reconstruction is the process of transforming the acquired Fourier space (k-space) data to image domain. Since MRI is a slow acquisition modality, only samples (under sampled) of k-space data are acquired to obtain the reconstruction. However, this reconstruction suffers from aliasing artifacts. Currently, deep learning networks are developed to de-alias the artifact and provide reconstruction equivalent of sampling the entire (fully sampled) k-space. \cite{isbi_wang} proposed a basic convolution neural network (CNN) to learn the representation between under sampled (US) and fully sampled (FS) image. Later, \cite{residual_conference} introduced residual learning which showed that learning the aliasing artifacts is easier and better than learning the FS image. \cite{automap} proposed AUTOMAP, a fully connected network to operate on the k-space domain to learn the mapping between US k-space and FS image. \cite{unet_dc} proposed to use U-Net with data consistency (DC) block to retain the known frequency components in predicted FS image. \cite{dc_cnn} introduced DC-CNN, a deep cascade network with each cascade containing CNN and DC blocks. \cite{dc_unet} replaced CNN in DC-CNN with U-Net. 
% Likewise, \cite{miccan} introduced MICCAN, which uses U-Net with channel-wise attention block. \cite{dc-ensemble} introduced DEN, a deep ensemble network which gives dense connections to CNN in DC-CNN. \cite{recursive_dilated} introduced RDN, a recursive dilated network to achieve better performance with lesser network parameters. \cite{kikinet} proposed to predict FS k-space followed by FS image using CNN. \cite{hybrid} proposed a deep cascade network which uses both k-space and image domain CNN.

\subsection{Knowledge distillation}
\cite{kd} introduced the concept of KD in deep neural networks for model compression. In their work, they proposed a student-teacher paradigm where the student, a lesser parameter network, obtains the knowledge from the teacher by learning the class distributions via the softmax layer. \cite{fitnet} proposed hint training, \cite{flow_kd} introduced Flow of Solution Procedure (FSP), \cite{attention_kd} developed attention mechanism which enables student to learn the intermediate representations of teacher. Unlike previous works, \cite{regression_object_kd} proposed a knowledge transfer procedure for regression based on teacher bounded loss. Recently, \cite{regression_pose_kd} proposed various ways of blending the loss of the student with respect to the ground truth and the teacher.  

\section{Methodology}
\subsection{MRI reconstruction problem formulation}
Let $\textbf{x} \in {C}^{N}$ represent a column stacked vector of complex valued MR image with dimension $\sqrt N \times \sqrt N$. Let $\textbf{y} \in {C}^{M}$ represent the undersampled k-space measurements. By definition, $\textbf{y} = \textbf{F}_{u}\textbf{x}$, where $\textbf{F}_{u} \in {C}^{M \times N}$ is an undersampled fourier encoding matrix. Our problem is to reconstruct $\textbf{x}$ from $\textbf{y}$. This linear inversion $\textbf{x}_u = \textbf{F}^{H}_{u}\textbf{y}$ is fundamentally ill-posed and generates an aliased image due to violation of Nyquist-Shannon sampling theorem. The deep learning formulation to obtain $\textbf{x}$ is given by:
\begin{align}
\label{eq:1}
 \underset{x,\theta}{min}\quad ||\textbf{x} - f_{cnn}(\textbf{x}_{u}\lvert\theta) ||_{2}^2   
+ \lambda||\textbf{F}_{u}\textbf{x} - \textbf{y} ||^2_{2}
\end{align}
where $f_{cnn}$ is the deep network parameterised by $\theta$, which learns the mapping between $\textbf{x}_u$ and  $\textbf{x}$. To provide data consistency for the network's output, the following data fidelity procedure is followed:
\begin{equation}
\label{eq:2}
\hat{\textbf{x}}_{rec}=
\begin{cases}
  \hat{\textbf{x}}_{cnn}(k)  & \ k\notin\Omega \\
  \frac{\hat{\textbf{x}}_{cnn}(k) + \lambda \hat{\textbf{x}}_{u}(k)}{1+\lambda} & k\in\Omega \\
\end{cases}
\end{equation}
where $\hat{\textbf{x}}_{cnn} = \textbf{F}x_{cnn}, x_{cnn}=f_{cnn}(\textbf{x}_{u}\lvert\theta)$, $\hat{\textbf{x}}_{u} = \textbf{F}\textbf{x}_{u}$, $\textbf{x}_{rec}= \textbf{F}^{-1}\hat{\textbf{x}}_{rec}$, $\Omega$ is an index set indicating which k space measurements have been sampled.

\begin{figure}[h]
    \centering
    \includegraphics[width=0.9\linewidth]{images/KD-DC-CNN.png}
    \caption{Teacher DC-CNN: Five cascades with each cascade having five convolution layers. Student DC-CNN: Five cascades with each cascade having three convolution layers. Attention transfer and imitation loss helps in teacher-student knowledge transfer. Attention transfer loss is obtained between the output of third and second convolution layer of each cascade in Teacher and Student DC-CNN. Imitation loss is obtained between the outputs of Teacher and Student DC-CNN.}
    \label{fig:outline}
\end{figure}

\subsection{Proposed knowledge distillation framework}
KD is the process of transferring the knowledge of a large teacher network to a small student network. The main idea of KD is to achieve parameter reduction without significant drop in performance. In our work, we design KD methods for MRI reconstruction and demonstrate its efficacy by applying it to the commonly used DC-CNN network. 
This choice was made considering the following factors:
1) Simple design involving fully convolutional layers,
2) Extensibility to other CNN based MRI-Reconstruction architectures (\cite{hybrid},\cite{dc-ensemble},\cite{recursive_dilated}).  
Figure \ref{fig:outline} depicts the overview of DC-CNN using KD.  

\textbf{Deep cascade of convolutional neural network (DC-CNN)}
DC-CNN \cite{dc_cnn} consists of a cascade of convolution layers and a data consistency (DC) layer. The number of cascades in DC-CNN is given by $n_{c}$. A single cascade has $n_{d}$ convolution layers and 1 DC layer.  The kernel size for each convolution layer is $3\times3$ with stride and padding set to 1. The initial convolution layer takes 1 channel (real) as input (US image) and gives 32 feature maps while the final convolution layer takes 32 feature maps as input and gives 1 channel as output (FS image). The number of input and output feature maps for the other convolution layers is 32. ReLU is used to introduce non-linearity between convolution layers. DC layer fill the predicted k-space with known values to provide consistency in Fourier domain. The cascade has a residual connection which sums the output of the cascade with its input. 

\textbf{Teacher DC-CNN}:
DC-CNN with five cascades ($n_{c}=5$) and five convolution layers ($n_{d}=5$) is chosen as teacher. Let $f_{cnn}^{T}$ parametrized by $\theta^{T}$ be the teacher DC-CNN. Then, reconstruction $\textbf{x}_{rec}^{T}$ from teacher is given by:  $\textbf{x}_{rec}^{T} = f_{cnn}^{T}(\textbf{x}_{u}\lvert\theta^{T})$.

\textbf{Student DC-CNN}:
DC-CNN with five cascades ($n_{c}=5$) and three convolution layers ($n_{d}=3$) is chosen as student. Let $f_{cnn}^{S}$ parametrized by $\theta^{S}$ be the student DC-CNN. Then, the reconstruction $\textbf{x}_{rec}^{S}$ from student is given by:  $\textbf{x}_{rec}^{S} = f_{cnn}^{S}(\textbf{x}_{u}\lvert\theta^{S})$.

\subsubsection{Attention-based feature distillation}

\cite{attention_kd} used attention maps as a feature distiller and showed improvement in classification performance of student networks. For classification tasks, the attention maps provide the significance of each activation in the feature map w.r.t the input. However, in the case of image-to-image regression problems, the attention maps would provide an intermediate image reflecting the final reconstructed output. Hence, this would provide the most direct form of teacher supervision for MRI Reconstruction. Thus, the goal is to make the student network mimic the attention map of teacher network allowing it to learn better intermediate representations. 

Let the feature maps after activation be denoted as $A \in R^{C\times H\times W}$, where C is the number of channels and $H \times W$ is the spatial dimension. The attention map of the features is given by $F_{sum}(A) = \sum_{i=1}^{C}|A_{i}|^{2}$. To obtain effective information distillation, we adapt the following attention transfer loss:
\begin{equation}
  \label{eq:3}
    L_{AT} = \sum_{j \in I}||\frac{Q_{S}^{j}}{||Q_{S}^{j}||_{2}} - \frac{Q_{T}^{j}}{||Q_{T}^{j}||_{2}}||_{2}
\end{equation}
where $Q_{S}^{j} = vec(F_{sum}(A_{S}^{j}))$ and $Q_{T}^{j} = vec(F_{sum}(A_{T}^{j}))$ represent the $j$-th pair of student and teacher attention maps in vectorized form, $I$ denote the set of teacher-student convolution layers which is selected for attention transfer. In our case, the convolution layer at the center of each cascade from teacher and student DC-CNN form the set $I$. This choice of distillation position was made after trying different combinations which is reported in Appendix B.
% \ref{sect:choice_of_feature_distillation_position}. 

\subsubsection{Imitation loss}
\cite{regression_pose_kd} proposed to use the imitation loss as an additional constraint along with the student loss and showed performance improvement in student networks. We incorporate this loss in our MRI reconstruction problem as it can act as a regularizer to the student reconstruction loss. As this constraint is enforced along with the regular reconstruction loss, there is no additional overhead in terms of training time unlike the attention transfer. Herein, we propose a total reconstruction loss for student as follows:
\begin{equation}
\label{eq:4}
L_{total}^{S} = \alpha L^{S}_{rec} + (1 - \alpha) L_{imit}
\end{equation}
where $L^{S}_{rec} = || x - x_{rec}^{S} ||$ is the loss between student prediction and target, $L_{imit} = || x_{rec}^{T} - x_{rec}^{S} ||$ is the imitation loss between teacher and student prediction
\subsubsection{Train procedure}
\begin{algorithm2e}[H]
\label{algo:1}
\SetAlgoLined
\begin{itemize}
    \item Step1: Train the teacher DC-CNN $f^{T}_{cnn}$ weights $\theta^{T}$ using teacher reconstruction loss $L^{T}_{rec} = || x - x_{rec}^{T} || $\;
    \item Step2: Train the student DC-CNN $f^{S}_{cnn}$ weights $\theta^{S}$ using attention transfer loss $L_{AT} = ||Q_{T} - Q_{S}||$ between teacher and student\;
    \item Step3: Load the weights $\theta^{S}$ from Step2 and re-train $f^{S}_{cnn}$ weights $\theta^{S}$ using student reconstruction and imitation loss $L_{total}^{S} = \alpha || x - x_{rec}^{S} ||  + (1 - \alpha)|| x_{rec}^{T} - x_{rec}^{S} || $\;
\end{itemize}
\caption{Knowledge transfer procedure}
\end{algorithm2e}
\section{Experiments and Results}
\subsection{Dataset Description, Evaluation metrics and Implementation details}
\textbf{Dataset Description:} 1) \textbf{Cardiac MRI dataset}: Automated Cardiac Diagnosis Challenge (ACDC) \cite{acdc_dataset} consists of 150 and 50 patient records for training and validation respectively. We extracted the 2D slices and cropped to 150$\times$150. These amount to 1841 and 1076 for train and validation. 2) \textbf{Brain MRI dataset}: MRBrainS dataset \cite{mrbrains_dataset} contains T1, T1-IR and T2-FLAIR volumes for 7 subjects. We use T1 MRI with size 240$\times$240. For training and validation, 5 subjects with 240 slices and 2 subjects with 96 slices are used. 3) \textbf{Knee MRI dataset}: The dataset used by \cite{knee_dataset} has coronal proton density knee volumes for 20 subjects acquired using 15-element knee coil. Each slice in the volume with dimension $640\times368$ has 15 channels and its respective sensitivity maps. The multi-channel slices are converted to single channel through root sum of squares. 10 subjects (200 slices) and other 10 (200 slices) are used for train and validation. US k-space and US images are retrospectively obtained using fixed cartesian undersampling masks for 4x, 5x and 8x acceleration factors. In ACDC and MRBrainS datasets, the undersampling masks sample ten while in the knee MRI dataset, thirty lowest spatial frequencies are sampled. The remaining frequencies follow a zero-mean Gaussian distribution. The undersampling masks can be found in Appendix C. 
% \ref{sect:us_masks}. 

\textbf{Evaluation metrics}: Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) metrics are used to evaluate the reconstruction quality. Wilcoxon signed-rank test with an alpha of 0.05 is used to assess statistical significance. 

% \subsection{Implementation Details}
\textbf{Implementation Details}
Models are implemented in PyTorch(0.4.0) \footnote{Code available at \href{https://github.com/Bala93/KD-MRI}{https://github.com/Bala93/KD-MRI}}. $\alpha$ is set empirically to 0.5. For every step mentioned in Algorithm \ref{algo:1}, models are trained for 150 epochs using the Adam optimizer, with a learning rate of 1e-4.

\subsection{Results and discussion}
\begin{table}[]
\tiny
\centering
\caption{Quantitative comparison between Zero-filled (ZF)(US Image), Teacher (T-DC-CNN), Student (S-DC-CNN) and our proposed model (S-KD-DC-CNN) across PSNR and SSIM metrics for ACDC, MRBrainS and Knee MRI datasets. Red indicates best and blue indicates second best performance.}
\label{tab:result1}
\begin{tabular}{|l|l|c|c|c|c|c|c|}
\hline
\multicolumn{2}{|l|}{}                             & \multicolumn{2}{c|}{4x}                                                            & \multicolumn{2}{c|}{5x}                                                            & \multicolumn{2}{c|}{8x}                                                            \\ \cline{3-8} 
\multicolumn{2}{|l|}{\multirow{-2}{*}{}} & PSNR                                    & SSIM                                     & PSNR                                    & SSIM                                     & PSNR                                    & SSIM                                     \\ \hline
                                         & ZF      & 24.27 $\pm$ 3.10                        & 0.6996 $\pm$ 0.08                        & 23.82 $\pm$ 3.11                        & 0.6742 $\pm$ 0.08                        & 22.83 $\pm$ 3.11                        & 0.6344 $\pm$ 0.09                        \\ \cline{2-8} 
                                         & Teacher & {\color[HTML]{FE0000} 32.51 $\pm$ 3.23} & {\color[HTML]{FE0000} 0.9157 $\pm$ 0.04} & {\color[HTML]{FE0000} 31.49 $\pm$ 3.32} & {\color[HTML]{FE0000} 0.9002 $\pm$ 0.04} & {\color[HTML]{FE0000} 28.43 $\pm$ 3.13} & {\color[HTML]{FE0000} 0.8335 $\pm$ 0.06} \\ \cline{2-8} 
                                         & Student & 31.92 $\pm$ 3.17                        & 0.9053 $\pm$ 0.04                        & 30.79 $\pm$ 3.24                        & 0.8863 $\pm$ 0.05                        & 27.87 $\pm$ 3.11                        & 0.8156 $\pm$ 0.07                        \\ \cline{2-8} 
\multirow{-4}{*}{Cardiac}                & Ours    & {\color[HTML]{3166FF} 32.07 $\pm$ 3.21} & {\color[HTML]{3166FF} 0.9084 $\pm$ 0.04} & {\color[HTML]{3166FF} 31.01 $\pm$ 3.27} & {\color[HTML]{3166FF} 0.8913 $\pm$ 0.04} & {\color[HTML]{3166FF} 28.11 $\pm$ 3.17} & {\color[HTML]{3166FF} 0.8236 $\pm$ 0.07} \\ \hline
                                         & ZF      & 31.38 $\pm$ 1.02                        & 0.6651 $\pm$ 0.02                        & 29.93 $\pm$ 0.80                        & 0.6304 $\pm$ 0.02                        & 29.37 $\pm$ 0.98                        & 0.6065 $\pm$ 0.03                        \\ \cline{2-8} 
                                         & Teacher & {\color[HTML]{FE0000} 40.03 $\pm$ 2.00} & {\color[HTML]{FE0000} 0.9781 $\pm$ 0.00} & {\color[HTML]{FE0000} 39.03 $\pm$ 1.28} & {\color[HTML]{FE0000} 0.971 $\pm$ 0.00}  & {\color[HTML]{FE0000} 35.04 $\pm$ 1.38} & {\color[HTML]{FE0000} 0.9374 $\pm$ 0.01} \\ \cline{2-8} 
                                         & Student & 39.36 $\pm$ 1.82                        & 0.9753 $\pm$ 0.00                        & 38.58 $\pm$ 1.28                        & 0.9674 $\pm$ 0.00                        & 34.39 $\pm$ 1.26                        & 0.9281 $\pm$ 0.01                        \\ \cline{2-8} 
\multirow{-4}{*}{Brain}                  & Ours    & {\color[HTML]{3166FF} 39.8 $\pm$ 1.89}  & {\color[HTML]{3166FF} 0.977 $\pm$ 0.00}  & {\color[HTML]{3166FF} 38.78 $\pm$ 1.24} & {\color[HTML]{3166FF} 0.9688 $\pm$ 0.00} & {\color[HTML]{3166FF} 34.83 $\pm$ 1.35} & {\color[HTML]{3166FF} 0.9337 $\pm$ 0.01} \\ \hline
                                         & ZF      & 29.66 $\pm$ 3.86                        & 0.8066 $\pm$ 0.08                        & 29.2 $\pm$ 3.87                         & 0.8007 $\pm$ 0.08                        & 28.71 $\pm$ 3.88                        & 0.7985 $\pm$ 0.08                        \\ \cline{2-8} 
                                         & Teacher & {\color[HTML]{FE0000} 37.15 $\pm$ 3.55} & {\color[HTML]{FE0000} 0.9436 $\pm$ 0.03} & {\color[HTML]{FE0000} 35.16 $\pm$ 3.46} & {\color[HTML]{FE0000} 0.9231 $\pm$ 0.03} & {\color[HTML]{FE0000} 32.53 $\pm$ 3.49} & {\color[HTML]{FE0000} 0.8887 $\pm$ 0.05} \\ \cline{2-8} 
                                         & Student & 36.37 $\pm$ 3.53                        & 0.9367 $\pm$ 0.03                        & 34.37 $\pm$ 3.47                        & 0.9144 $\pm$ 0.04                        & 31.92 $\pm$ 3.58                        & 0.8804 $\pm$ 0.05                        \\ \cline{2-8} 
\multirow{-4}{*}{Knee}                   & Ours    & {\color[HTML]{3166FF} 36.7 $\pm$ 3.52}  & {\color[HTML]{3166FF} 0.9392 $\pm$ 0.03} & {\color[HTML]{3166FF} 34.71 $\pm$ 3.44} & {\color[HTML]{3166FF} 0.9181 $\pm$ 0.04} & {\color[HTML]{3166FF} 32.32 $\pm$ 3.57} & {\color[HTML]{3166FF} 0.8867 $\pm$ 0.05} \\ \cline{2-8} \hline
\end{tabular}
\end{table}
\subsubsection{Quantitative and qualitative comparison}
We compare S-KD-DC-CNN (Student DC-CNN trained using our KD procedure) with S-DC-CNN (Student DC-CNN  trained without assistance of teacher) and T-DC-CNN (Teacher DC-CNN) for cardiac, brain and knee across 4x, 5x and 8x accelerations. Table \ref{tab:result1} provides the quantitative comparison of the above methods. From the table, it can be observed that S-KD-DC-CNN provides significantly better performance than S-DC-CNN (Difference of S-KD-DC-CNN and S-DC-CNN for PSNR and SSIM are statistically significant ($p < 0.05$)). This bridges the performance gap between teacher and student. Qualitative comparison of the methods are shown in Figure \ref{fig:result1}. In the Figure, it can be clearly seen that reconstructions obtained using S-KD-DC-CNN is closer to T-DC-CNN  while S-DC-CNN has relatively higher information loss. The performance shown by S-KD-DC-CNN is due to the combination of Attention Transfer (AT) and imitation loss which help in reconstructing fine structures. The imitation loss is expected to behave as a regularizer to the student reconstruction loss while the AT assists the student to learn the intermediate representations of the teacher. We verified the same by obtaining residue between attention maps of T-DC-CNN and S-DC-CNN and compared it with the residue between attention maps of T-DC-CNN and S-KD-DC-CNN. We found that attention map of S-KD-DC-CNN is closer to T-DC-CNN compared to S-DC-CNN. Thus, learning these representations provide pre-trained weights which helps in more optimized training. Qualitative and quantitative comparison of the residues for each cascade along with the observations are reported in Appendix D.
% \ref{sect:feature_residue}.

\begin{figure}
    \centering
    \includegraphics[width=\linewidth]{images/KDresults.png}
    \caption{From Left to Right: Zero-filled, Target, Teacher (T-DC-CNN), Student (S-DC-CNN), Ours (S-KD-DC-CNN)), Teacher Residue, Student Residue, KD Residue. From Top to Bottom: MRBrainS, ACDC, Knee MRI. All the images are displayed for an acceleration factor of 5x. Upon examination, in addition to lower reconstruction errors the distilled model is able to retain finer structures better when compared to the student.}
    \label{fig:result1}
\end{figure}

\subsubsection{Parameter count and running time}    
We calculate the parameter count and running time of T-DC-CNN and S-DC-CNN for Knee 4x acceleration to understand the effect of model compression. The parameter count of T-DC-CNN and S-DC-CNN are 141K and 49K respectively. The CPU running time for single image reconstruction for T-DC-CNN and S-DC-CNN are 568 ms and 294 ms respectively. The GPU running time for single image reconstruction of T-DC-CNN and S-DC-CNN are 24ms and 16ms respectively. As the number of parameters in S-KD-DC-CNN is equivalent to that of S-DC-CNN, we can state the following: 1) S-KD-DC-CNN gives 65\% parameter reduction as compared to T-DC-CNN, 2) The CPU running time of S-KD-DC-CNN  is nearly 2 times lower than that of T-DC-CNN, 3) The GPU running time of S-KD-DC-CNN  is nearly 1.5 times lower than that of T-DC-CNN. 

\begin{figure}
    \centering
    \includegraphics[width=0.9\linewidth]{images/KDAblativestudy.png}
    \caption{Left(a): Reconstruction loss of various feature distillation methods on the validation set. T-DC-CNN (Teacher), S-DC-CNN (Student), S-FN-DC-CNN (FN), S-FSP-DC-CNN (FSP), S-SP-DC-CNN (SP), S-AH-DC-CNN (AH) and S-AT-DC-CNN (Ours). Right(b): Ablation study of attention transfer and imitation loss functions.}
    \label{fig:result2}
\end{figure}

\subsubsection{Comparison of feature distillation methods}
We draw comparisons of our Attention Transfer method (AT) to other feature distillation methods, namely; FitNets (FN) \cite{fitnet}, Flow of Solutions Procedure (FSP) \cite{flow_kd}, Similarity Preserving KD (SP) \cite{tung2019similarity} and Attentive Hint (AH) \cite{regression_pose_kd}, for cardiac 8x acceleration. In this experiment, we pre-train the student network using weights obtained from feature distillation methods. During the fine tuning stage, we only consider the student reconstruction loss ignoring the imitation loss (by setting $\alpha=1$ in Eq. \ref{eq:4}). In FN, the student is expected to learn the entire feature map of the teacher. In the case of FSP, the student is entasked with mimicking the teacher in terms of the flow between the feature maps of the first and the penultimate layer. In SP, the student learns to mimic the similarity map of the intermediate layers of the teacher. In AH, teacher supervision is provided in a weighted fashion based on the reconstruction quality of the teacher. In AT, the student is expected to produce a sum of the feature maps in a fashion which is similar to that of the teacher. Figure \ref{fig:result2}a depicts the validation error loss for networks T-DC-CNN, S-DC-CNN, S-FN-DC-CNN, S-FSP-DC-CNN, S-SP-DC-CNN, S-AH-DC-CNN and S-AT-DC-CNN. The validation loss obtained using network S-AT-DC-CNN is lesser when  compared to other methods and is thus, closer to the teacher loss. This empirically demonstrates that AT is better at transferring the knowledge of the teacher to the student. The quantitative comparison of AT with other feature distillation methods can be found in Appendix E.
% \ref{sect:feature_distillation_quantitative_comparison}. 
% This behaviour can be attributed to 

\subsubsection{Ablative study of attention transfer and imitation loss}
% We conduct an ablative study to understand the effect of attention transfer and imitation loss for cardiac 8x acceleration. Figure \ref{fig:result2}b presents a validation error plot comparing S-DC-CNN trained using the following four cases: 1) $L_{rec}^{S}$, 2) $L_{rec}^{S}$ and $L_{imit}$, 3) $L_{rec}^{S}$ and $L_{AT}$ and 4) $L_{rec}^{S}, L_{imit}$ and $L_{AT}$. From the graph, it can be inferred that the validation error of S-DC-CNN trained using $L_{imit}$ and $L_{rec}^{S}$ is lower than training the network with $L_{rec}^{S}$. Similarly, the validation error of S-DC-CNN trained using $L_{AT}$ and $L_{rec}^{S}$ is lower than training the network with $L_{rec}^{S}$. This shows the contribution of both $L_{AT}$, $L_{imit}$ terms in producing lower validation error. However, the following things can also be inferred from the graph 1) using $L_{rec}^{S}$, $L_{imit}$ and $L_{AT}$ provides validation error almost equal to that of $L_{rec}^{S}$ and $L_{AT}$ and 2) validation error of S-DC-CNN trained using $L_{AT}$ is lower than training the network with  $L_{imit}$ and $L_{rec}^{S}$. This shows that, attention transfer is key to knowledge distillation. 
We conduct an ablative study to understand the effect of attention transfer and imitation loss for cardiac 8x acceleration. Figure \ref{fig:result2}b presents a validation error plot comparing S-DC-CNN trained using different combination of loss functions. From the graph, it can be inferred that the validation error of S-DC-CNN trained using ($L_{imit}$ and $L_{rec}^{S}$) and ($L_{AT}$ and $L_{rec}^{S}$) is lower than training the network with $L_{rec}^{S}$. This shows the contribution of both $L_{AT}$, $L_{imit}$ terms in producing lower validation error. However, the following things can also be inferred from the graph 1) using $L_{rec}^{S}$, $L_{imit}$ and $L_{AT}$ provides validation error almost equal to that of $L_{rec}^{S}$ and $L_{AT}$ and 2) validation error of S-DC-CNN trained using $L_{AT}$ and $L_{rec}^{S}$ is lower than training the network with  $L_{imit}$ and $L_{rec}^{S}$. 
% This shows that, attention transfer is key to knowledge distillation. 
This demonstrates that attention transfer is a key tenet of effective knowledge distillation.

\section{Conclusion}
We proposed a knowledge distillation (KD) framework for image to image problems in the MRI workflow in order to develop compact, low-parameter models without a significant drop in performance. We propose obtaining teacher supervision through a combination of attention transfer and imitation loss. We demonstrated its efficacy on the DC-CNN network  and show consistent improvements in student reconstruction across datasets and acceleration factors.

\bibliography{murugesan20}

\end{document}
