\section{Introduction}

Multiple sclerosis (MS) is a neuroinflammatory demyelinating disease that occurs in the central nervous system.
Magnetic resonance imaging (MRI) is the most commonly used non-invasive technique for detecting MS lesions.
The total lesion volume--\emph{lesion load} \cite{mckinley2020automatic}--presented in MRI is often considered as an important clinical marker that can be a measure of disease severity. 
The lesion load is usually obtained with automated lesion segmentation methods \cite{zhang2019multiple,zhang2020efficient,aslani2019multi,zhang2019multiple} or trained clinical experts.
However, the changes of lesion load have been shown to have weak correlations with measures of the disease severity \cite{dworkin2018automated}.
While several studies \cite{khoury1994longitudinal,rudick2006significance} have shown that the changes of lesion counts are correlated with the changes of Expanded Disability Status Scale (EDSS), which indicates that the lesion count can be a better surrogate for the measure of the disease severity. 
Besides, researchers show that the segmentation mask of individual lesions are also very important, as the shape and appearance of the lesion can be used for classification of rim+ V.S. rim- lesions \cite{barquero2020rimnet}, the outcome of which can further assist the diagnosis.
In this paper, we borrow from the computer vision community, and character the two problems, counting and segmenting individual lesions, as a unified task called lesion instance segmentation.


\begin{figure}[!t]
	\centering
    \includegraphics[width=0.95\columnwidth,height=0.2255\columnwidth]{figs/img_example.png}
	\caption{ 
	Visualization of image examples. 
	(A) An T2-FLAIR image example of a region with confluent lesions. 
	(B) Individual lesions labeled by trained experts.
	(C) Offset value map along sagittal direction.
	(D) Offset value map along coronal direction.
	(E) Weight value map. (For value maps, red represents positive values, and blue represents negative values.)
	}
	\label{fig:img_example}
\vspace{-2ex}
\end{figure}


The lesion instance segmentation can be especially difficult when the patient has many confluent lesions.
These confluent lesions usually occur when individual lesions from different sources of structural damage of the brain grow connected with each other.
Subject to the lesion burden of different patients, the number of individual lesions in a single confluent lesion may range from 2 to dozens spanning a large stretch of regions in the brain. 
Due to these confluent lesions, it may cost a experienced expert several hours to mask out individual lesions from a patient with heavy lesion burden.
Besides, the segmentation usually suffers from inter- and intra-observer variability, resulting in consistency problem.
Automated methods can help resolve the issue, but a clinically reliable one is not yet ready. 

In this paper, we introduce a learning based technique to address the above issue for obtaining a valid and reliable estimation of both lesion counts and segmentation masks of confluent lesions from a large-scale cross-sectional MR imaging study of MS.
Specifically, we propose a fully automated method \emph{Memory U-Net} that combines the advantage of U-Net \cite{ronneberger2015u} and memory networks \cite{sukhbaatar2015end}, where we use the the former as the feature extractor and the latter as the code book for generalized Hough voting. \cite{ballard1981generalizing}.
The main contributions of the paper can be summarized as follows:
\begin{itemize}
    \item This is the first time that a deep convolutional neural network architecture is developed for simultaneously counting and segmenting confluent MS lesions, and the performance of the proposed method surpasses all baseline methods;
    \item This is also the first time that a memory network is embedded in a U-Net, where the stored prototypical feature vectors become a powerful replacement of the code book for generalized Hough voting; 
    \item We develop a novel memory updating mechanism and variance-reducing loss function to alleviate the long-tailed problem in offset estimation, and evaluation on a large-scale cross-sectional MR imaging study with 150 MS patients shows the effectiveness of our method.
\end{itemize}


\section{Related Works}

The memory network shows promising results in memorizing invariance \cite{zhong2019invariance}, minority \cite{he2020learning} and anomaly \cite{gong2019memorizing} in data, while in our work, we take the advantage of the memory network to memorize prototypical features for offset voting to localize and distinguish individual lesions.  
\vspace{1ex} \\
\textbf{MS Lesion Segmentation.} Great progress has been made so far in MS lesion Segmentation. 
While traditional non-learning based methods such as Lesion-TOADS \cite{shiee2010topology} and LST \cite{schmidt2012automated} have shown reasonably good performance, recent advances of deep CNNs have demonstrated more promising results.
% Multi-branch residual network \cite{aslani2019multi} is proposed to improve the generalization ability of the model using multi-modal MR images.
RSA-Net \cite{zhang2019rsanet} and FA-net \cite{zhang2020efficient} apply self-attention mechanism to augment the contextual information aggregation for the network.
GEO Loss \cite{zhang2020geometric} uses boundary-aware loss function to regularize network training, and NeRD \cite{zhang2021nerd} transforms features based on image coordinates. 
The 2.5-D Tiramisu network \cite{zhang2019multiple} segments one slice based on its three adjacent slices, which can trade-off the efficiency and the ability of extracting contextual information of the model.  
A statistical technique \cite{dworkin2018automated} is proposed to count individual lesions based on the lesion probability map and local gradient properties.
In our work, based on an initial binary mask, we count and segment individual lesions.
\vspace{1ex} \\
\textbf{Memory Networks.} Neural networks augmented with memory modules have attracted increasing attention in completing various tasks.
\cite{sukhbaatar2015end} uses an external memory module to improve the performance of language modeling, and this is the first memory network that can be trained in an end-to-end manner.
\cite{gong2019memorizing} applies an external memory module to augment the auto-encoder structure for anomaly detection.
\cite{zhong2019invariance} regulates the network training using memory-augmented branch and specifcally designed loss functions for cross-domain person re-identification.
\cite{he2020learning} demonstrates that an external memory module can be used to force the network to pay more attention to minority cases during training for 3D point cloud segmentation. 
In our work, the memory module stores key-value pairs, where key is the prototypical feature verctor, and the value is the offset value towards a certain direction.
\vspace{1ex} \\
\textbf{Hough Voting.} The original Hough transform \cite{hough1962method} is designed to detect straight lines in images. 
The generalized Hough transform \cite{ballard1981generalizing} is later proposed to improve the original one for arbitrary shape detection.
Object detection and segmentation are also achieved by using Hough voting.
\cite{leibe2006interleaving} proposes a probabilistic Hough voting model for object segmentation, and \cite{maji2009object} further optimizes parameters of the probabilistic model using max-margin constraint.
Additionally, better segmentation performance for MRI and ultrasound image segmentation is observed by Hough-CNN \cite{milletari2017hough}.
The state-of-the-art performance for object detection in 3D point clouds has been achieved by integrating Hough voting \cite{qi2019deep,qi2020imvotenet} into the deep neural networks.
In our work, the voting mechanism working jointly with the memory network is adopted for better lesion localization. 
