\documentclass{midl} % Include author names
%\documentclass[anon]{midl} % Anonymized submission

% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution
\usepackage{multirow}
\usepackage{mwe} % to get dummy images
\jmlrvolume{-- Under Review}
\jmlryear{2021}
\jmlrworkshop{Full Paper -- MIDL 2021}
\editors{Under Review for MIDL 2021}

\title[Distill DSM]{Distill DSM: Computationally efficient method for segmentation of medical imaging volumes}

 % Use \Name{Author Name} to specify the name.
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % Two authors with the same address
 % \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\and
 %  \Name{Author Name2} \Email{xyz@sample.edu}\\
 %  \addr Address}

 % Three or more authors with the same address:
 % \midlauthor{\Name{Author Name1} \Email{an1@sample.edu}\\
 %  \Name{Author Name2} \Email{an2@sample.edu}\\
 %  \Name{Author Name3} \Email{an3@sample.edu}\\
 %  \addr Address}


% Authors with different addresses:
% \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\\
% \addr Address 1
% \AND
% \Name{Author Name2} \Email{xyz@sample.edu}\\
% \addr Address 2
% }

%\footnotetext[1]{Contributed equally}

% More complicate cases, e.g. with dual affiliations and joint authorship
\midlauthor{\Name{Harsh Maheshwari\nametag{$^{1}$}} \Email{harshmaheshwari135@gmail.com}\\
\Name{Vidit Goel\nametag{$^{1}$}} \Email{gvidit98@gmail.com}\\
\Name{Ramanathan Sethuraman\nametag{$^{2}$}} \Email{ramanathan.sethuraman@intel.com}\\
\Name{Debdoot Sheet\nametag{$^{1}$}} \Email{debdoot@ee.iitkgp.ac.in}\\
% \Name{Author Name4\midljointauthortext{Contributed equally}\nametag{$^{3}$}} \Email{uvw@foo.ac.uk}\\
% \addr $^{3}$ Address 3 \AND
% \Name{Author Name5\midlotherjointauthor\nametag{$^{4}$}} \Email{fgh@bar.com}\\
\addr $^{1}$ Indian Institute of Technology, Kharagpur \\
\addr $^{2}$ Intel Technology India Pvt. Ltd, Bangalore
}

\begin{document}

\maketitle

\begin{abstract}
Accurate segmentation of volumetric scans like MRI and CT scans is highly demanded for surgery planning in clinical practice, quantitative analysis, and identification of disease. However, accurate segmentation is challenging because of the irregular shape of given organ and large variation in appearances across the slices. In such problems, 3D features are desired in nature which can be extracted using 3D convolutional neural network (CNN). However, 3D CNN is compute and memory intensive to implement due to large number of parameters and can easily over fit, especially in medical imaging where training data is limited. In order to address these problems, we propose a distillation-based depth shift module (Distill DSM). It is designed to enable 2D convolutions to make use of information from neighbouring frames more efficiently. Specifically, in each layer of the network, Distill DSM learns to extract information from a part of the channels and shares it with neighbouring slices, thus facilitating information exchange among neighbouring slices. This approach can be incorporated with any 2D CNN model to enable it to use information across the slices with introducing very few extra learn-able parameters. We have evaluated our model on BRATS 2020, heart, hippocampus, pancreas and prostate dataset. Our model achieves better performance than 3D CNN for heart and prostate datasets and comparable performance on BRATS 2020, pancreas and hippocampus dataset with simply 28\% of parameters compared to 3D CNN model. 

% We have evaluated our model on brats2020, heart, hippocampus, pancreas and prostate dataset. Our method achieved state of the art performance for heart and prostate datasets and comparable performance on brats2020, pancreas and hippocampus dataset.



% The proposed method achieves the state of the art performance in heart and prostate datasets and less than 1\% drop in dice score on brats and  hippocampus dataset.

\end{abstract}

\begin{keywords}
Deep learning, volumetric segmentation, parameter efficient 3D CNN, distillation, channel shifting 
\end{keywords}

\section{Introduction}
Medical imaging using Computed Tomography(CT) and Magnetic Resonance Imaging (MRI) is frequently used in clinical practice for investigating a wide range of conditions, e.g., injury prediction, disease diagnosis, surgery simulation, therapeutic planning, etc. It is often required to segment the portion of interest in a given CT or MR to interpret or analyse the clinical conditions.  \par
Manual segmentation of medical images is tedious and labour intensive work and often leads to high variation across reporters, which motivates the need to automate the segmentation process. With the advances in deep learning methods, convolutional neural network (CNN) are becoming increasingly popular in being applied to various medical image segmentation tasks to increase consistency across multiple human experts.  \par
Medical images like CT and MRI are 3D in nature and widely used in clinical diagnosis. In order to perform the segmentation of volumetric data, we can employ the following possible strategies. The first is by considering the 3D volume as set of individual 2D slices and training 2D CNN for segmenting the structures of interest in 2D slices. Another approach is to enable the network operations to process volumetric data by using 3D CNN and train the 3D CNN for volumetric segmentation. Using 2D CNNs for segmentation results in a computationally light model with faster inference time. However, it does not take into account the information from adjacent slices, resulting in a model with lowered segmentation accuracy. On the other hand, 3D CNN is able to incorporate information from adjacent slices for better quality of segmentation and has the same spatial field of view as that in a 2D CNN, but it requires higher computation cost resulting in lowered throughput and higher latency. On account of the large number of parameters, 3D CNNs are prone to overfitting, especially with small dataset.  \par
In order to bridge the performance gap between 2D CNN and 3D CNN, we propose a simple and computationally efficient technique with computational complexity in the order of 2D CNN, while being able to incorporate the interslice information for enhanced quality.  In this paper, we introduce a novel component termed Distill DSM, which is able to effectively model information along the depth dimension, motivated by TSM\footnote{Originally Temporal shift Module (TSM), but since here volumetric data is used, it will be referred as Depth shift module (DSM) in further sections of paper} \cite{tsm} originally for action recognition in videos. 
% Though inspired by DSM, we found that DSM is not perfectly suitable for semantic segmentation as it losses spatial and semantic information while shifting channel, which makes us propose Distill DSM. 
The proposed module can be inserted in any 2D segmentation network to improve its performance with a negligible increase in the number of parameters and order of computation.
In each layer of the network where present, Distill DSM learns to extract information that is useful for the current slice and information that is useful for the immediate neighbours thereby mitigating the loss of information. Distill DSM achieves performance comparable to state-of-the-art 3D CNN model on BRATS 2020\footnote{https://www.med.upenn.edu/cbica/brats2020/data.html}, pancreas\footnote{\label{msdfoot}http://medicaldecathlon.com/}, hippocampus\textsuperscript{\ref{msdfoot}} dataset and better results compared to 3D CNN on heart\textsuperscript{\ref{msdfoot}} and prostate\textsuperscript{\ref{msdfoot}} dataset with just \textbf{28\%} parameters as compared to state of the art 3D CNN. Our paper has the following contributions-:
\begin{itemize}
    \item We propose a distillation-based depth shift module that enables to segment volumetric data using 2D convolution by extracting and sharing necessary information to neighbouring slice along depth-dimension, reducing the model size to 28\% of the state-of-the-art 3D CNN.
    \item The proposed solution is a plug-and-play module which could be incorporated with any 2D CNN architecture to model information along the Z direction.
    \item We did a comprehensive evaluation on five datasets to validate the proposed method.
\end{itemize}
\section{Related Work}

\subsection{Segmentation}
Earlier approaches to segment images use classical image processing techniques such as thresholding, region-growing methods, etc. \cite{87344}. Recent approaches make use of machine learning techniques. Segmentation of 2D medical images using deep neural networks has an accuracy close to human performance today \cite{book, doi:10.1146/annurev-bioeng-071516-044442, FMBCAMBBR19, ronneberger2015unet, nandamuri2019sumnet}. Initial approaches to segment 3D volumes used 2D CNNs to segment 2D slices individually \cite{milletari2016houghcnn}. This approach, although being computationally friendly, it does not have good accuracy. More recently, fully convolutional architectures employ 3D convolutions such as 3D U-Net \cite{3DUNET} and V-Net \cite{milletari2016vnet}, which result in high performance but are computationally expensive.
\subsection{Efficient Neural Network for learning 3D feature}
Efficient neural network commonly uses 2D CNN along with some techniques to learn 3D feature in computationally inexpensive manner. Approaches for learning 3D feature using 2D CNN is mostly classified in three streams: 1) 2D slice distillation \cite{chen2016combining,Christ_2016,cai2017improving,Novikov_2019} 2) 2.5D \cite{inproceedingsPrasoon, AMBELLAN2019109,li2018hdenseunet,xia2018bridging,yu2019thickened} and 3) 2D multiple views \cite{Wang_2019,Li_2019}. Methods adopting 2D slice distillation, distill 3D features from 2D features learned by 2D CNNs from 2D slices by employing recurrent neural network \cite{rnnbook,lipton2015critical} or conditional random field \cite{NIPS2004_0c215f19, Zheng_2015}. 2.5D based methods learns 3D features by giving several 2D slices as input to a 2D CNN. 2D multiple view based methods extract information from multiple views (usually: axial, coronal, and sagittal) and combine the information from multiple views for predicting the output. Another method generally adopted for making efficient CNN is to make use of binary kernel \cite{Heinrich2018OBELISKO, rastegari2016xnornet, juefeixu2017local}. This approach reduces the parameter drastically. \par

Depth Shift Module \cite{tsm} shifts part of feature channels in each frame to its neighbouring frame so that 2D convolution could handle depth information. Based on this idea of integrating depth information to 2D convolution, we propose Distill DSM. 

\section{Methodology}
\subsection{Problem Formulation}
In the segmentation task for 3D image data, let $X_i$ and $Y_i$ represent input image volume and the segmentation maps respectively, where $X_i = \{x_1, x_2,...,x_{N_{i}} \}$  and $Y_i = \{y_1, y_2,...,y_{N_{i}} \} $, where  $x_j \subseteq \mathbb{R}^{ H\times W}$ is a 2D slice of medical image and $y_j$ is segmentation mask for the corresponding 2D slice. Different $X$ have different number of 2D slices i.e. different $N_i$. Our objective is to find $F$ such that objective function given below minimises.
\begin{figure}
    \centering
    \includegraphics[height = 5cm, width=12cm]{DSM.png}
    \caption{First row contains the intermediate convolutional features of the slices $\mathbf{Z_{i}}$, $\mathbf{Z_{i+1}}$, $\mathbf{Z_{i+2}}$, $\mathbf{Z_{i+3}}$. Some parts of the channels are shifted to neighbouring slices to exchange information. The second row contains convolutional features after shifting is done. The channel indicated by white color represents zero padding.}
    \label{DSM}
\end{figure} 

\begin{equation}
    J = \dfrac{1}{K} \sum_{i=1}^{i=K} L(F(X_i), Y_i)
    \label{eq:eq1}
\end{equation}

\noindent where $K$ is total number of 3D volumes in the training dataset and $L$ is the loss function, which is computed using model output and ground truth.

\subsection{Intuition}
3D CNN's capture inter slice information by convolving a 3D kernel to 3D input, which basically helps in gathering information from neighbouring slices to current slice. The operation results in gathering the complete information, including both spatial and semantic from the neighbouring slices. Using all the information from the neighbouring slices could be redundant in many cases.  This makes them highly computationally and memory expensive. \par
A more efficient and simple way to exchange information among the neighbouring slices is to shift some channels of the current slice to neighbouring slices as shown in Figure \ref{DSM} and proposed in \cite{tsm}. There have been various works \cite{bau2017network, bau2020units} which show different channels correspond to different semantics. 
%Hence, this method would help in gathering semantic information from the neighbouring slices at a negligible computation cost. 
However, hard shifting of the channel will lead to loss of some information from the current slice including both semantic and spatial information. In order to prevent loss of information, \cite{tsm} introduced residual DSM, where they add back the initial feature to channel shifted feature. However, addition is not an effective way to merge information and result in loss of some spatial and semantic information from current slice. This could lead to the drastic decrease in the performance of model, especially for segmentation task. If we can retain the necessary information (from the channels we are shifting) in the current slice and pass the necessary information required by neighbouring frame, then it would result in a highly efficient architecture having benefits of 3D architecture at a cost of 2D CNN. Motivated by this, we propose a novel architecture Distill Depth Shift Module(DSM).

\begin{figure}[t]
    \centering
    \includegraphics[height = 8cm, width=16cm]{distilled_TSM.png}
    \caption{Above module represents Distill DSM, in which from a part of feature channels of every slice, three kinds of information are extracted. Features to retain, features to pass to forward slice, and features to pass to backward slice. Channels shown with white color in the second row represent zero-padded channels.}
    \label{distill_DSM}
\end{figure} 

\subsection{Distill DSM}
The proposed Distill DSM is shown in Figure \ref{distill_DSM}. We extract three components of information from the part of feature channels which were shifted to neighbouring slices in DSM. Consider a feature map of $i^{th}$ frame $\mathbf{Z}_{i}$ $\subseteq \mathbb{R}^{C\times h\times w}$ where $C$ is the number of channels and $h,w$ is the spatial size. We select $\alpha C$ channels from the end of $\mathbf{Z}_{i}$ where $\alpha \in [0, 1]$ and distill the information stored into three components as follows: 
1) $\mathbf{R}_{i} \subseteq \mathbb{R}^{\alpha \frac{C}{2}\times h\times w} $: Necessary information to retain in $\mathbf{Z}_{i}$ 2) $\mathbf{F}_{i} \subseteq \mathbb{R}^{\alpha \frac{C}{4}\times h\times w} $: Necessary information to pass to forward slice $\mathbf{Z}_{i+1}$ 3) $\mathbf{B}_{i} \subseteq \mathbb{R}^{\alpha \frac{C}{4} \times h\times w} $: Necessary information to pass to backward slice $\mathbf{Z}_{i-1}$. In order to calculate the distilled information($\mathbf{R}_{i}, \mathbf{F}_{i}, \mathbf{B}_{i}$) we use a convolution layer for each of them. Now the retained information ($\mathbf{R}_{i}$) from current slice,  forward information($\mathbf{F}_{i+1}$) and backward information($\mathbf{B}_{i-1}$) from the next and previous slice respectively are concatenated to the channels of current slice $\mathbf{Z}_{i}$. These operations are done for each and every slice in the volume. This completes the Distill DSM operation for the slices, after this each and every slice has information from previous and the following slice along with its own information. In the case of the first slice where there is no previous slice, the slice is zero padded in channel dimension to maintain the shape and similar adjustment is done for the last slice. The schematic for 4 slices is shown in Figure \ref{distill_DSM}. Note, in the first Distill DSM layer only immediate neighbouring slices would share the information but as we go deeper in the network the slices which are far away would also be sharing the information0. 


% Then \textit{retainInfo} is concatenated with the feature channels of $Z_{i}$ and \textit{forwardInfo} is concatenated with features of forward slice and similarly \textit{backwardInfo} is concatenated with feature channel of backward slice. 

\par
Using our approach for exchanging information helps in two ways. 1) Loss of information from the current slice is minimal as we perform distillation with the help of convolution to retain necessary information. 2) Information passed to forward and backward slice is not hard shifted i.e. model itself decides, what information it should pass and to what extent it should pass the information, as it can be the case in semantic segmentation that in initial layer the information exchange is more among neighbouring slices as to capture the spatial-temporal structures so that deeper layers could focus on semantics of current slice and hence segmenting the required portion. \par

% \subsection{Network structure}
% We have used encoder decoder based UNet architecture. The proposed Distill DSM is incorporated between the double convolution used in UNet architecture. 

\begin{figure}[t]
    \centering
    \includegraphics[height = 8.5cm, width=13cm]{qual.pdf}
    \vspace{-1cm}
    \caption{Results on BRATS 2020 dataset. First row, shows a case where 3D UNet fails to predict correctly where as Distill DSM predicts correctly. Last row is the boundary slice, where Distill DSM and 3D UNet are able to predict some of the segmentation maps where as DSM and 2D UNet are not, proving most essential thing that proposed model is able to model information along depth dimension}
    \label{qual}
\end{figure} 

\section{Experiments}
\subsection{Datasets}
To compare with the baseline \cite{ronneberger2015unet, 3DUNET, tsm} we train and test our model on BRATS 2020 \cite{bakas2019identifying} dataset and on 4 representative datasets of Medical Decathlon challenge \cite{simpson2019large}. The first dataset is Cardiac which includes 20 mono-modal MR volumes for segmentation of left atrium. The second is Hippocampus which includes 263 mono-modal MR volumes for segmentation of hippocampus head and body. The third is Prostate which includes 32 multi-modal MR volumes for segmentation of central gland and peripheral zone. The fourth being Pancreas, which includes 282 CT volumes for segmentation of liver and tumour. All the datasets from Medical decathlon were randomly splitted into 5 folds, by randomly shuffling the sequence of volumes and splitting the dataset into 5 fixed folds. Brats 2020 dataset consists of 371 training volumes and 127 testing volumes. Training data is further splitted into 4:1 ratio for training and validation and results are shown on testing dataset. \par

The Dice similarity coefficient and Hausdorff Distance 95 is used to evaluate proposed model for medical decathlon dataset and Dice similarity coefficient,  Hausdorff Distance 95 (HSD), Sensitivity, and Specificity is used to evaluate proposed model on BRATS 2020 dataset.  



\begin{table*}[t]
\centering
\caption{Quantitative segmentation results of 2D U-Net, 3D U-Net, Residual DSM and Distill DSM on BRATS 2020 dataset. ET represents Enhancing Tumor, WT represents Whole Tumor and TC represents Tumor Core}
\resizebox{0.87\textwidth}{!}{\begin{tabular}{cccccc}
\hline
                             & \textbf{Class} & \textbf{2D U-Net} & \textbf{Residual DSM} & \textbf{3D U-Net} & \textbf{Distill DSM(Ours)} \\ \hline
Parameters                   &                & 1,082,211         & 1,082,211             & 4,288,208         & 1,216,266     \\ \hline
Flops per voxel                   &                & 38,662         & 38,735             & 58,709         & 39,456     \\ \hline
Wall time per voxel(s)                   &                & 7.9498e-7         & 8.1726e-7             & 8.6517e-7         & 8.2641e-7     \\ \hline
\multirow{3}{*}{Dice}        & ET             & 0.712             & 0.732                 & 0.704             & 0.753         \\ 
                             & WT             & 0.861             & 0.867                 & 0.879             & 0.873         \\  
                             & TC             & 0.687             & 0.704                 & 0.796             & 0.742         \\ \hline
\multirow{3}{*}{Sensitivity} & ET             & 0.714             & 0.707                 & 0.687             & 0.761         \\  
                             & WT             & 0.859             & 0.835                 & 0.898             & 0.841         \\ 
                             & TC             & 0.660             & 0.693                 & 0.779             & 0.726         \\ \hline
\multirow{3}{*}{Specificity} & ET             & 0.9997            & 0.99978               & 0.99975           & 0.99969       \\  
                             & WT             & 0.99903           & 0.99939               & 0.99896           & 0.99944       \\ 
                             & TC             & 0.99975           & 0.99986               & 0.99958           & 0.9997        \\ \hline
\multirow{3}{*}{Hausdorff95} & ET             & 35.20             & 29.21                 & 43.27             & 30.52         \\ 
                             & WT             & 6.52              & 8.42                  & 11.46             & 5.98          \\ 
                             & TC             & 27.39             & 34.85                 & 18.84             & 32.87         \\ \hline
\end{tabular}}
\label{bratsResult}
\end{table*}

\subsection{Implementation details}
Our experiments are implemented using PyTorch on NVIDIA Tesla V100 GPUs (16GB memory) and are carried out on Ubuntu machine with 96GB RAM and 32 cores. All networks use dice per channel loss function and Adam optimizer. Proposed distill DSM is integrated with U-Net \cite{ronneberger2015unet} architecture setting. After each convolutional operator in 2D U-Net, a distill DSM layer is added so that 2D convolution processes each slice individually and then pass it through Distill DSM to exchange information among the 2D slices.  We have used $\alpha = \frac{1}{2}$ for experiments.
% \begin{table}[t]
% \centering
% \begin{tabular}{cccc}
% \hline
%             & \multicolumn{1}{c}{\textbf{Parameters}} & \multicolumn{2}{c}{\textbf{Dice Coefficient}} \\ \hline
% alpha = 1/4 & 1,115,606                               & 0.8888$\pm$0.006          & 0.8618$\pm$0.006          \\
% alpha = 1/2 & 1,216,266                               & 0.8955$\pm$0.005          & 0.8786$\pm$0.008          \\
% alpha = 1   & 1,619,330                               & 0.8958$\pm$0.003          & 0.8795$\pm$0.002          \\ \hline
% \end{tabular}
% \caption{Ablation experiments on hippocampus dataset}
% \label{ablation}
% \end{table}

% \begin{table}[t]
% \centering
% \caption{Ablation experiments on hippocampus dataset}
% \begin{tabular}{cccc}
% \hline
%             & \multirow{}{}{\textbf{Parameters}} & \multicolumn{2}{c}{\textbf{Dice Coefficient}} \\
%             &                                      & 1                     & 2                     \\ \hline
% alpha = 1/4 & 1,115,606                            & 0.8888+0.006          & 0.8618+0.006          \\
% alpha = 1/2 & 1,216,266                            & 0.8955+0.005          & 0.8786+0.008          \\
% alpha = 1   & 1,619,330                            & 0.8958+0.003          & 0.8795+0.002          \\ \hline
% \end{tabular}
% \label{ablation}
% \end{table}

% Please add the following required packages to your document preamble:
% \usepackage{multirow}
% \usepackage[table,xcdraw]{xcolor}
% If you use beamer only pass "xcolor=table" option, i.e. \documentclass[xcolor=table]{beamer}
% Please add the following required packages to your document preamble:
% \usepackage{multirow}
% \usepackage[table,xcdraw]{xcolor}
% If you use beamer only pass "xcolor=table" option, i.e. \documentclass[xcolor=table]{beamer}
% Please add the following required packages to your document preamble:
% \usepackage{multirow}
% \usepackage[table,xcdraw]{xcolor}
% If you use beamer only pass "xcolor=table" option, i.e. \documentclass[xcolor=table]{beamer}

\begin{table}[t]
\centering
\caption{Ablation experiments for $\alpha$ hyper parameter}
\label{ablation}
\resizebox{0.7\textwidth}{!}{\begin{tabular}{ccccc}
\hline
                             & Metric & $\alpha$ = $\frac{1}{4}$ & $\alpha$ = $\frac{1}{2}$  & $\alpha$ = $1$    \\ \hline
Parameter                    &        & 1,115,606    & 1,216,266    & 1,619,330    \\ \hline
Heart                        & Dice 1 & 0.9125$\pm$0.008 & 0.9235$\pm$0.011 & 0.9203$\pm$0.009 \\ \hline
\multirow{2}{*}{Hippocampus} & Dice 1 & 0.8888$\pm$0.006 & 0.8955$\pm$0.005 & 0.8958$\pm$0.003 \\
                             & Dice 2 & 0.8618$\pm$0.006 & 0.8786$\pm$0.008 & 0.8795$\pm$0.002 \\ \hline
\multirow{2}{*}{Prostate}    & Dice 1 & 0.8325$\pm$0.037  & 0.8724$\pm$0.014 & 0.8721$\pm$0.008 \\
                             & Dice 2 & 0.7565$\pm$0.065 & 0.7804$\pm$0.081 & 0.7818$\pm$0.076 \\ \hline
\end{tabular}}
\end{table}

\subsection{Ablation study}
In this section value of $\alpha$ is determined. With increase in value of $\alpha$, the number of parameters increase and so is the information sent to forward and backward slice. Experiments were conducted on Heart, Hippocampus and Prostate dataset of medical decathlon challenge for varying value of $\alpha$ and is summarised in Table \ref{ablation}. It can be observed that from $\alpha = \frac{1}{4}$ to $\alpha = \frac{1}{2}$ there is huge improvement in performance, however from $\alpha = \frac{1}{2}$ to $\alpha = 1$ there isn't much improvement, but the parameters increased drastically. Hence, $\alpha = \frac{1}{2}$ is used for all the experiments as it results in parameter efficient and highly accurate model.


\begin{table*}[t]
\centering
\caption{Quantitative segmentation results of 2D U-Net, 3D U-Net, Residual DSM and Distill DSM on heart, hippocampus, prostate and pancreas segmentation dataset from medical segmentation decathlon dataset}
	\resizebox{0.95\textwidth}{!}{\begin{tabular}{ccccccc}
\hline
\textbf{Dataset}                      & \textbf{Metric} & \textbf{2D U-Net}     & \textbf{Residual DSM} & \textbf{VFN} & \textbf{3D U-Net}     & \textbf{Distill DSM(Ours)}          \\ \hline
\multirow{2}{*}{Heart} & Dice   & 0.9025$\pm$0.004 & 0.9076$\pm$0.009 & 0.9085$\pm$0.010 & 0.918$\pm$0.009  & 0.9235$\pm$0.011  \\  
                             & HD     & 3.4723$\pm$0.118 & 1.8953$\pm$0.147 & 1.4588$\pm$0.153 & 1.2523$\pm$0.145 & 1.0056$\pm$0.123  \\ \hline
\multirow{3}{*}{Hippocampus} & Dice 1 & 0.8802$\pm$0.002 & 0.8901$\pm$0.007 & 0.8919$\pm$0.005 & 0.8993$\pm$0.004 & 0.8955$\pm$0.005  \\ 
                             & Dice 2 & 0.8618$\pm$0.011 & 0.8648$\pm$0.010 & 0.8663$\pm$0.009 & 0.8847$\pm$0.008 & 0.8786$\pm$0.008  \\  
                             & HD     & 1.6825$\pm$0.082 & 1.4209$\pm$0.045 & 1.4826$\pm$0.089 & 1.2587$\pm$0.075 & 1.3325$\pm$0.13   \\ \hline
\multirow{3}{*}{Prostate}    & Dice 1 & 0.7847+0.041 & 0.7948$\pm$0.033 & 0.8068$\pm$0.052 & 0.8164$\pm$0.041 & 0.8724$\pm$0.014  \\  
                             & Dice 2 & 0.6978$\pm$0.085 & 0.70214$\pm$0.07 & 0.7425$\pm$0.076 & 0.7339$\pm$0.066 & 0.7804$\pm$0.081  \\  
                             & HD     & 8.0664$\pm$0.567 & 6.8528$\pm$0.532 & 5.6835$\pm$0.692 & 5.5961$\pm$0.217 & 4.6294$\pm$0.485  \\ \hline
\multirow{3}{*}{Pancreas}    & Dice 1 & 0.7395$\pm$0.024 & 0.7624$\pm$0.02  & 0.7650$\pm$0.018 & 0.7739$\pm$0.016 & 0.792$\pm$0.022   \\  
                             & Dice 2 & 0.3485$\pm$0.036 & 0.3632$\pm$0.032 & 0.3684$\pm$0.031 & 0.4115$\pm$0.030  & 0.3765$\pm$0.035  \\  
                             & HD     & 16.7597$\pm$3.88 & 14.4535$\pm$3.82 & 13.6824$\pm$3.26 & 12.5574$\pm$1.04 & 11.9875$\pm$3.436 \\ \hline
\end{tabular}}
\label{MSDResult}
\end{table*}
\subsection{Comparison of results}
In the comparison experiments, we compare proposed distill DSM with 2D segmentation method like U-Net \cite{ronneberger2015unet}, 3D segmentation method like 3D U-Net \cite{3DUNET} and computationally efficient method like residual DSM  \cite{tsm} and VFN \cite{xia2018bridging}. Table \ref{bratsResult} summarises results on BRATS 2020 dataset \cite{bakas2019identifying}. Table \ref{bratsResult} also have contains number of parameter, flops per voxel and wall time per voxel i.e. inference time per voxel. It can be observed that with very nominal parameter around 28\% of 3D U-Net architecture we are able to achieve comparable results and our method outperform both residual DSM and 2D U-Net. Figure \ref{qual} visualises output of segmentation map from different methods. \par 
Table \ref{MSDResult} summarises results on various dataset of Medical Decathlon challenge \cite{simpson2019large}. Our method outperform 3D U-Net in case of heart and prostate dataset where we have limited number of 3D volume available for training. It is because of reason that 3D CNN are prone to over fitting especially when we have limited dataset available for training, making our method more efficient in such scenarios.

\section{Conclusion}
Our work focused on computationally efficient semantic segmentation module for volumetric data. We proposed a novel module Distill Depth shift module (Distill DSM) for efficiently using the information along the depth dimension with negligible increase in the parameters compared to 2D CNN. The proposed module can be inserted in any segmentation architecture to make use of depth information. We were able to achieve either better or comparable results to 3D CNN with only 28\% of parameters. The proposed method was extensively tested on various datasets including BRATS 2020 and 4 datasets from medical decathlon challenge validating our the proposed method.

% Acknowledgments---Will not appear in anonymized version
% \midlacknowledgments{We thank a bunch of people.}

\bibliography{Maheshwari21}

% \appendix

% \section{Proof of Theorem 1}

% This is a boring technical proof of
% \begin{equation}\label{eq:example}
% \cos^2\theta + \sin^2\theta \equiv 1.
% \end{equation}

% \section{Proof of Theorem 2}

% This is a complete version of a proof sketched in the main text.

\end{document}