\section{Experimental Results}

We use PyTorch \cite{paszke2019pytorch} for all of our implementations.
We compare our proposed model with four carefully designed baseline methods.
All models in the experiments are trained or run in a machine with two RTX 2080 Ti GPUs.

\subsection{Implementation Details}

We split the dataset into three sets, where the training set contains 100 subjects, the validation set contains 15 subjects, and the testing set contains 35 subjects. 
For all learning-based methods, each model is trained and tested using three different random seeds, and the final result is the average of three tests.  
We use a fixed cropping size $(96\times96\times64)$ for random image cropping, and use random intensity shifting for data augmentation. 
We adopt Adam \cite{kingma2014adam} as our optimizer for all learning based methods, with the initial learning rate of $1e-3$ and a multi-step scheduler with steps at $50\%, 70\%$ and $90\%$ of the total number of epochs.
A batch size of four is used for training, and the training lasts 40 epochs.
A model that achieves minimum loss on the validation set will be used for testing.
For all non-learning methods, each model is tested only on the testing set of the dataset.

Additionally, for leaning-based methods, once we get the voting tensor containing offsets from three directions as well as the voting weight, we can generate a density map followed by non-maximal suppression (NMS) to pick up peak points (NMS can be implemented efficiently using max-pooling operations).
We then use the peak points as the initial center points for K-means clustering, and the final segmentation can be obtained when the clustering converges.
We use Symmetric Best Dice coefficient (SBD) and the absolute Difference in Counting (DiC) as our evaluation metrics.
Specifically, the SBD averages the Dice score between pairs of predicted and ground-truth lesions that yield the maximum Dice score.


\subsection{Baseline Methods}

To the best of our knowledge, we are the first to tackle the problem of simultaneously counting and segmenting individual lesions from confluent lesions in MS study, thus, there is no prior method for comparison.
To show the effectiveness of our proposed method, we have carefully designed five baseline methods for comparison as follows.
\vspace{1ex} \\
\textbf{T2-FLAIR NMS and K-Means}. Since MS lesions exhibit hyper-intensities on T2-FLAIR image, we use NMS on T2-FLAIR image to select points with local maximal intensity as potential lesion centers.
With the lesion centers, we further use K-Means to obtain the final segmentation.
\vspace{1ex} \\
\textbf{X-Means}. X-Means \cite{pelleg2000x} is an extension of K-Means clustering method, which can automatically determine the number of clusters using information theoretic techniques.
\vspace{1ex} \\
\textbf{LST NMS and K-Means}. LST \cite{schmidt2012automated} is a commonly used lesion segmentation tool. 
Similarly, we use NMS on the output probability map to select points with local maximal probability as potential lesion centers.
\vspace{1ex} \\
\textbf{U-Net}. Due to the strong abstraction ability of CNN models, the U-Net alone can be used as an implicit code book to generate offset votes.
\vspace{1ex} \\
\textbf{Memory U-Net--}. Since \equationref{eq:memory_update} and \equationref{eq:var} are used to alleviate the long-tail issue in offset estimation, we intentionally deduct these two equations from the memory U-Net-- implementation to see how it affects the results. 

\subsection{Effectiveness of the Methods}

Performance comparison among five baseline  methods and our proposed memory U-Net are shown in \tableref{tab:results}.
We can see from \tableref{tab:results} that all learning based methods outperform non-learning intensity based methods (except LST) by a significant margin in both DiC and SBD metrics.
Specifically, X-Means performs the worst, as it has no prior information of where the lesion centers can be.
% Without the prior information, information theoretic technique for automatically determining the number of clusters fails completely.
T2-FLAIR NMS and K-Means achieves an OK result, which shows the effectiveness of prior information and the importance of T2-FLAIR MRI sequence for detecting MS lesions.
Though LST achieves a similar SBD score as our proposed method, it falls behind in lesion counting.
For learning-based methods, the proposed Memory U-Net outperforms the U-Net alternative in both evaluation metrics. 
Specifically, compared with U-Net, our proposed method reduces around $8\%$ of error in counting the number of individual lesions, and improves SBD score by around $2\%$

Since the DiC metric measures the absolute difference, we point out here that both non-learning methods estimate more lesions than ground-truth, and all learning-based methods estimate fewer lesions than ground-truth. 
Interestingly, the Memory U-Net-- without \equationref{eq:memory_update} and \equationref{eq:var} achieves lower DiC but worse SBD, which delivers a message that Memory U-Net-- has more wrong peak votes compared to Memory U-Net.
In conclusion, memory based U-Net surpass non-memory U-Net, and the proposed Memory U-Net further improves the performance of the lesion instance segmentation by considering the long-tailed issue.

\begin{table}[!t]
\caption{Quantitative comparison with average (standard deviation) among five methods.}
\vspace{1ex}
\begin{center}
    \resizebox{0.65\columnwidth}{!}
    {
        \begin{small}
        %\begin{sc}
        \begin{tabular}{l|cc}
        \hline
        \hline
        %\toprule
                    Methods                              & DiC $\downarrow$    & SBD $\uparrow$  \\
        %\midrule
        \hline
                    T2-FLAIR NMS and K-Means             &3.567   &0.853 \\
                    X-Means                              &54.533  &0.460 \\
                    LST NMS and K-Means                  &4.233   &0.914 \\
                    U-Net                                &2.933 (0.047)   &0.901 (0.004) \\      
                    Memory U-Net--                       &\textbf{2.417} (0.024)   &0.895 (0.007) \\
                    Memory U-Net                         &2.700 (0.046)   &\textbf{0.915} (0.001) \\
        %\bottomrule
        \hline
        \hline
        \end{tabular}
        %\end{sc}
        \end{small}
        \label{tab:results}
    }
\end{center}
\vspace{-4ex}
\end{table}