\section{Experiments and results}

We provide experiments to evaluate segmentation of \textit{unprocessed} scans, eliminating the dependence on additional tools which can be CPU intensive and require manual tuning. 


\subsection{Datasets}

We use four datasets with an array of modalities, and contrast variations within modalities. All datasets contain labels for 37 regions of interest (ROIs), with the same labeling protocol.

\paragraph{T1-39:} 39 whole head T1 scans with manual segmentations \cite{fischl_freesurfer_2012}. We split the dataset into subsets of 20 and 19 scans. We use the labels maps of the first 20 as the only inputs to train \netname{}, and evaluate on the held-out 19. We augmented the manual labels with approximate segmentations for skull, eye fluid, and other extra-cerebral tissue, computed semi-automatically with in-house tools, to enable synthesis of full head scans.

\paragraph{T1mix:}  1,000 T1 whole head MRI scans collected from seven public datasets: ABIDE \cite{di_martino_autism_2014}, ADHD200 \cite{the_adhd-200_consortium_adhd-200_2012}, GSP \cite{holmes_brain_2015}, HABS \cite{dagley_harvard_2017}, MCIC \cite{gollub_mcic_2013}, OASIS \cite{marcus_open_2007}, and PPMI \cite{marek_parkinson_2011}. Although these scans share the same modality, they exhibit variability in intensity distributions and head positioning due to differences in acquisition platforms and sequences. Since manual delineations are not available for these scans, we evaluate against automated segmentations obtained with FreeSurfer~\cite{fischl_freesurfer_2012,dalca_anatomical_2018}. T1mix enables evaluation on a large dataset of heterogeneous T1 contrasts.

\paragraph{FSM:} 18 subjects, each with 3 modalities: T1, T2, and a sequence typically used in deep brain stimulation (DBS). The DBS scan is an MP-RAGE with: TR $= \SI{3000}{\milli\second}$, TI $= \SI{406}{\milli\second}$, TE $=\SI{3.56}{\milli\second}$, $\alpha = 8^\circ$. With no manual delineations available, for evaluation we use automated segmentations produced by FreeSurfer on the T1 channel as ground truth for all modalities. This dataset enables evaluation on two new contrasts, T2 and DBS.

\paragraph{T1-PD-8:} T1 and proton density (PD) scans for 8 subjects, with manual delineations. These scans were approximately skull stripped prior to availability. Despite its smaller size, this dataset enables evaluation on another contrast (PD) that is very different than T1. \newline

\noindent Although FreeSurfer segmentations are not as accurate as manual delineations, they enable evaluation where manual labels are missing. FreeSurfer has been thoroughly evaluated on numerous independent datasets~\cite{fischl_whole_2002, tae_validation_2008}. It also yields high Dice scores against manual segmentations for T1-39 (0.88, albeit biased by mixing FreeSurfer training and testing data) and T1-PD-8 (0.85).


\subsection{Competing methods}

We compare our method \netname{} with three other approaches:

\paragraph{Fully supervised network:} We train a \textit{supervised} U-Net on the 20 training images from the T1-39 dataset (whole brain, unprocessed), aiming to assess difference in performance when testing on images of the same contrast (T1) acquired on the same and other platforms. We employ the same architecture and loss function as for \netname{}, and we use the same data augmentation when applicable, specifically spatial deformation, gamma augmentation, and normalization of intensities. This supervised network can only segment T1 scans, so we refer to it as ``T1 baseline".

\paragraph{SAMSEG:}  Based on the traditional Bayesian segmentation framework, SAMSEG~\cite{puonti_fast_2016} uses unsupervised likelihood distributions, and is thus fully contrast-adaptive. Like our method, SAMSEG can segment both unprocessed or skull-stripped scans. SAMSEG does not rely on neural networks, and thus does not require training, but instead employs an independent optimization for each scan requiring tens of minutes.

\paragraph{\netname{}-rule:} We also analyze a variant of our proposed method, where the intensity parameters are representative of the test scans to be segmented. For each of the seven contrasts present in the training data (T2, PD, DSB, and four varieties of T1), we build a Gaussian hyperprior for the means and standard deviations of each label, using ground truth segmentations. At training, for every mini-batch we sample one of the seven contrasts, then we sample the means and standard deviations for each class conditioned on the contrast. This variant enables us to compare the generation of unrealistic contrasts during training, against enforcing prior information on the target modalities, if available. An example of these more realistic synthetic images (conditioned on T1 contrast) is shown in \figureref{fig:augm_realistic}.

\begin{figure}[t]
\floatconts
  {fig:augm_realistic}
  {\caption{Generation of a T1-like image for training \netname{}-rule.}}
  {\centering\includegraphics[width=0.90\textwidth]{figures/generative_model_3.pdf}}
\end{figure}


\subsection{Experimental setup}

All CNN methods are trained on the training subset of T1-39, with our method variants only requiring the segmentation maps, whereas the supervised baseline also uses the T1 scans. We evaluate all approaches on the test subset of T1-39, as well as all of T1mix, T1-PD-8, and FSM. The T1 baseline is not tested on modalities other than T1, nor on T1-PD-8 because it cannot cope with skull stripped data. We assess performance with Dice scores, computed for a representative subset of 12 brain ROIs: cerebral white matter (WM) and cortex (CT), lateral ventricle (LV), cerebellar white matter (CW) and cortex (CC), thalamus (TH), caudate (CA); putamen (PU), pallidum (PA), brainstem (BS), hippocampus (HP), and amygdala (AM). We averaged results for contralateral structures. 


\subsection{Results}

\begin{table}[tbp]
\setlength\tabcolsep{3pt} 
\floatconts
  {tab:summary}
  {\caption{Summary of results, capturing the performance of each method, its ability to segment arbitrary modalities, and run time (averaged over 10 runs). SAMSEG was run using 8 cores (Intel Xeon at 3.00GHz), whereas \netname{} was run on an Nvidia P6000 GPU. Image loading time was not considered.}}
  {\small \begin{tabular}{c || c |  c |  c}
  \hline
  Method & Overall performance & modality-agnostic & runtime (s)  \\
  \hline \hline
 Supervised & 0.89 $\pm$ 0.10 (same dataset) 0.59 $\pm$ 0.11 (other T1s) & No & 3.06 $\pm$ 0.02\\  
 SAMSEG & 0.83 $\pm$ 0.02  & Yes & 1382 $\pm$ 192\\  
 \netname{}-rule & 0.82 $\pm$ 0.02 & Yes & 3.22 $\pm$ 0.03\\  
 \netname{} & 0.85 $\pm$ 0.02 & Yes & 3.22 $\pm$ 0.03\\  
  \hline
  \end{tabular}}
\end{table}

\tableref{tab:summary} provides a summary of the methods and their runtime. \figureref{fig:dice} shows box plots for each ROI, method, and dataset, as well as averages across the ROIs. Table~\ref{tab:pvalues} shows corresponding median scores and p values using Wilcoxon test. Finally,  \figureref{fig:examples} shows sample segmentations for every method and dataset. The supervised T1 baseline excels when tested on the test scans of T1-39 (i.e., intra-dataset), achieving a mean Dice of 0.89, and outperforming all the other methods for every ROI. However, when tested on T1 images from T1mix and FSM, we observe substantial variations in Dice scores (e.g., across the different sub-datasets within T1mix), with a consistent decrease in performance (see for instance the segmentation of the T1 in FSM in \figureref{fig:examples}). This is likely due to the limited variability in the training dataset, despite the use of augmentation techniques, highlighting the challenge of variation in \textit{unprocessed} scans from different sources, even within the same modality.

\begin{figure}[t]
\floatconts
  {fig:dice}
  {\caption{Dice scores obtained by each method, shown both in aggregate (top left), and for individual ROIs on each dataset. Sub-datasets of T1mix are marked with a star.}}
  {\includegraphics[width=\textwidth]{figures/boxplots.pdf}}
\end{figure}


\begin{table}[tbp]
\setlength\tabcolsep{3pt} 
\floatconts
  {tab:pvalues}
  {\caption{Median Dice scores and p values for two-sided non-parametric Wilcoxon signed-rank tests comparing \netname{} and the competing methods. Sub-datasets of T1mix are marked with a star.}}
  {\small \begin{tabular}{|c|c|c|c|c|c|c|c|}
  \hline
  & \netname{}  & \multicolumn{2}{c|}{T1-baseline}  &  \multicolumn{2}{c|}{SAMSEG} &   \multicolumn{2}{c|}{\netname{}-rule}  \\ \cline{2-8}
  Dataset  & Med. Dice  &  Med. Dice & p value  &   Med. Dice &  p value  &  Med. Dice  &  p value   \\ 
  \hline
  T1-39   & 0.861 & 0.894 & $p<10^{-3}$ &   {0.849} & $p<10^{-3}$  &  {0.819} & $p<10^{-3}$    \\ 
  T1mix   & 0.852 &  {0.601} & $p<10^{-94}$  & 0.858 & $p<10^{-30}$ & {0.806} & $p<10^{-85}$  \\ 
  ABIDE*  & 0.838 & {0.761} & $p<10^{-15}$  & 0.856 & $p<10^{-11}$   & {0.799} & $p<10^{-19}$  \\ 
  ADHD*   & 0.843  & {0.649} & $p<10^{-14}$  & 0.857 & $p<10^{-4}$   & {0.804} & $p<10^{-9}$  \\ 
  HABS*   & 0.858 & {0.630} & $p<10^{-4}$  & 0.859 & $0.7$  &  {0.819} & $p<10^{-4}$  \\ 
  GSP*    & 0.853  & {0.549} & $p<10^{-92}$  & 0.857 & $p<10^{-7}$  &  {0.806} & $p<10^{-90}$  \\ 
  MCIC*   & 0.869  & {0.750} & $p<10^{-5}$  & {0.863} & $ 6.4 \times 10^{-3}$  &  {0.829} & $p<10^{-5}$  \\ 
  OASIS*  & 0.855  & 0.857 & $p<10^{-4}$  & 0.867 & $p<10^{-10}$  &  {0.812} & $p<10^{-12}$  \\ 
  PPMI*   & 0.851  & {0.726} & $p<10^{-12}$  & 0.861 & $p<10^{-6}$  &  {0.811} & $p<10^{-12}$  \\
  T1 FSM  & 0.869  & {0.531} & $p<10^{-13}$  & 0.869 & $0.6$  &  {0.827} & $p<10^{-3}$  \\
  T2 FSM  & 0.841   & N/A   & N/A & {0.822} & $ 1.2 \times 10^{-3}$  &  {0.822} & $ 1.5 \times 10^{-2}$  \\  
  DBS FSM & 0.828  & N/A & N/A  & {0.821} & $ 3.8 \times 10^{-3}$  &  0.831 & $ 0.2$  \\                 
  T1-PD8  & 0.848  & N/A & N/A  & {0.823} & $ 1.7 \times 10^{-2}$  &  {0.810} & $ 1.2 \times 10^{-2}$  \\  
  PD-PD8  & 0.830  & N/A  & N/A  & {0.801} & $ 3.6 \times 10^{-2}$  &  0.830 & $ 0.4$  \\   
  \hline
  \end{tabular}}
\end{table}





\begin{figure}[t]
\floatconts
  {fig:examples}
  {\caption{Example segmentations for each method and dataset. We selected the median subject in terms of Dice scores across ROIs and methods.}}
  {\includegraphics[width=\textwidth]{figures/mosaic.pdf}}
\end{figure}

SAMSEG yields very uniform results across datasets of T1 contrasts, producing mean Dice scores within 3 points of each other. Being agnostic to contrast, it outperforms the T1 baseline outside its training domain. It also performs well for the non-T1 contrasts. Although the mean Dice scores are slightly lower than for the T1 datasets (which normally display better contrast between gray and white matter), they remain robust for every contrast and dataset with minimum mean Dice of 0.81.

\netname{} also produces high Dice across all contrasts, slightly higher than SAMSEG (0.02 mean Dice improvement), while requiring a fraction of its runtime (\tableref{tab:summary}). The difference between SAMSEG and \netname{} is smaller for T1mix and T1 FSM because SAMSEG is positively biased by the use of FreeSurfer segmentations as ground truth, since the methods work similarly. The improvement of \netname{} compared to SAMSEG is consistent across structures, except the cerebellum. Compared to the T1 baseline, the mean Dice is 0.03 lower on the supervised training domain (T1-39), but generalizes significantly better to other T1 datasets, and can segment other MRI contrasts with little decrease in performance (minimum mean Dice is 0.83).

Importantly, \netname{}-rule is outperformed by \netname{}, and its Dice scores are also slightly lower than those produced by SAMSEG. This illustrates that adapting the parameters to a certain contrast is counterproductive, at least within our simple generative model: we observe consistent drops in performance across ROIs and datasets, despite injecting contrast-specific knowledge for each modality. This result is consistent  with recent results in image augmentation~\cite{chaitanya_semi-supervised_2019}, and supports the theory that forcing the network to learn to segment a broader range of images than it will typically observe at test time improves generalization. 