\section{Methods}

We first introduce the generative model for Bayesian MRI segmentation, and then describe our method, which builds on this framework to achieve modality-agnostic segmentation.

\subsection{Classical generative model for Bayesian segmentation of brain MRI}
\label{bayesian}

The Bayesian segmentation framework relies on a probabilistic generative model for brain  scans. Let~$L$ be a 3D label (segmentation) map consisting of~$J$ voxels, such that each voxel value $L_j$ is one of~$K$ possible labels: $L_j \in \{1,\ldots,K\}$. The generative model starts with a prior anatomical distribution $p(L)$, typically represented as a (precomputed) statistical atlas~$A$, which associates each voxel location with a $K$-length vector of label probabilities. Additionally, the atlas $A$ is endowed with a spatial deformation model: the label probabilities are warped with a field $\phi$, parameterized by $\theta_{\phi}$, which follows a distribution~$p(\theta_{\phi})$ chosen to encourage smooth deformations. The probability of observing $L$ is then:
%
\begin{align}
    p(L| A, \theta_\phi) = \prod_{j=1}^{J} [A \circ \phi(\theta_{\phi})]_{j,L_j},
\end{align}
%
where $[A \circ \phi(\theta_{\phi})]_{j,L_j}$ is the probability of label $L_j$ given by the warped atlas at location $j$. 

Given a label map $L$, the image likelihood~$p(I|L)$ for its corresponding image $I$ is commonly modeled as a GMM (conditioned on $L$), modulated by smooth, multiplicative bias field noise (additive in the more convenient logarithmic domain). Specifically, each label $k \in \lbrace1,...,K\rbrace$ is associated with a Gaussian distribution for intensities of mean $\mu_{k}$, and standard deviation $\sigma_{k}$. We group these Gaussian parameters into $\theta_G = \{\mu_1,\sigma_1,\ldots,\mu_K,\sigma_K\}$. The bias field is often modeled as a linear combination of smooth basis functions, where linear coefficients are grouped in $\theta_{B}$ \cite{larsen_n3_2014}. The image likelihood is given by:
%
\begin{align}
    p(I | L, \theta_B, \theta_G) = \prod_j \mathcal{N}(I_j - B_j(\theta_B) ; \mu_{L_j} , \sigma_{L_j}^2),
    \label{eq:likelihood}
\end{align}
%
where~$\mathcal{N}(\cdot; \mu, \sigma^2)$ is the Gaussian distribution, $I_j$ is the image intensity at voxel $j$, and $B_j(\theta_B)$ is the bias field at voxel $j$. We assume that $I_j$ has been log-transformed, such that the bias field is additive, rather than multiplicative.

Bayesian segmentation uses Bayes's rule to ``invert" this generative model to estimate~$p(L|I)$, posing segmentation as an optimization problem. Such inversion often relies on computing point estimates for the model parameters. Fitting the Gaussian parameters $\theta_G$ to the intensity distribution of the test scan is what makes these methods contrast agnostic.

\begin{figure}[t]
\centering
\floatconts
  {fig:schematic}
  {%\vspace{-0.5cm}
  \caption{\netname{} overview. The proposed data generation process selects one of the available label maps~$S_m$ and employs a sampling strategy to synthesize an image-segmentation pair~$\{I,L\}$, based on a well-established generative model of brain MRI. Specific generation steps are illustrated in Figure~\ref{fig:augm_example} and detailed in Algorithm~\ref{alg:net}. The pairs~$\{I,L\}$ are used to train a CNN in a supervised fashion.}}
  {\includegraphics[width=1\textwidth]{figures/schematic.pdf} 
  }
\end{figure}
%
\begin{algorithm2e}[t]
\caption{Proposed Learning Strategy for \netname{}}
\label{alg:net}
\DontPrintSemicolon
\KwIn{$\{S_m\}_{m=1,\ldots,M}$   \tcp*{M segmentations}}
\While{not converged}{
$i \sim \mathcal{U}_d(1,M)$  \tcp*{select input map}
$\theta_{aff}  \sim \mathcal{U}(a_{rot},b_{rot})  \times   \mathcal{U}(a_{sc},b_{sc})  \times  \mathcal{U}(a_{sh},b_{sh}) \times \mathcal{U}(a_{tr},b_{tr}) $  \tcp*{affine parameters}
$\theta_v \sim \mathcal{N}_{10\times10\times10\times3}(0,\sigma_{svf}^2)$ \tcp*{sample SVF parameters}
$\phi_v(\theta_v) \leftarrow ScaleAndSquare[Upscale(\theta_v)]$ \tcp*{upscaling and integration}
$\phi \leftarrow \phi_{aff}(\theta_{aff}) \circ \phi_v(\theta_v)$  \tcp*{form deformation}
$L \leftarrow S_i \circ \phi$  \tcp*{deform selected label map}
$(\mu_k,\sigma_k) \sim  \mathcal{U}(a_{\mu},b_{\mu}) \times \mathcal{U}(a_{\sigma},b_{\sigma}), k=1,\ldots,K$  \tcp*{sample Gaussian parameters}
$G_j \sim \mathcal{N}(\mu_{L_{ij}},\sigma_{L_{ij}})$  \tcp*{sample GMM image}
$G^{blur} \leftarrow G * R(\sigma_{blur})$ \tcp*{Spatial blurring}
$\theta_B \sim \mathcal{N}_{4\times4\times4}(0,\sigma_{b}^2)$ \tcp*{sample bias field parameters}
$B  \leftarrow \exp[Upscale(\theta_B)]$ \tcp*{upscaling and exponentiation}
$G^{bias} \leftarrow G^{blur} \odot B$ \tcp*{bias field corruption}
$\gamma \sim   \mathcal{U}(a_\gamma,b_\gamma)$ \tcp*{gamma augmentation parameter}
$I \leftarrow f(G^{bias},\gamma)$  \tcp*{gamma and normalization via~\eqref{eq:intensity_augm}}
update CNN weights with pair $\{I, L\}$  \tcp*{SGD iteration}
}
\end{algorithm2e}


\subsection{Proposed approach}
\label{sec:approach}

We propose to train a segmentation CNN using synthetic data created on the fly with a generative model similar to that of Bayesian segmentation. Since the voxel independence assumption would yield extremely heterogeneous noisy images, we rely on a set of $M$ original label maps~$S=\{S_m\}_{m=1}^M$ instead of random samples from a probabilistic atlas. We also slightly blur the sampled intensities. The proposed learning strategy, detailed below, is summarized in \figureref{fig:schematic} and Algorithm~\ref{alg:net}, and exemplified in \figureref{fig:augm_example}. \newline

\paragraph{Data sampling:} In training, mini-batches are created  by sampling image-segmentation pairs~$\{I,L\}$ as follows. First, we randomly select a label map $S_i$ from the training dataset (\figureref{fig:augm_example}a), by sampling $i \sim \mathcal{U}_d(1,M)$, where $\mathcal{U}_d$ is the discrete uniform distribution. 

\begin{figure}[t]
\centering
\floatconts
  {fig:augm_example}
  {%\vspace{-0.3cm}
  \caption{Intermediate steps of image generation (axial slices of 3D volumes). (a)~Segmentation. (b)~Warp with random smooth deformation field. (c)~Image intensities sampled via a GMM with random parameters. (d)~Blur. (e)~Random  bias field. (f)~Synthesized images with the contours of the corresponding label maps.}}
  {\includegraphics[width=\textwidth]{figures/generative_model_1-2.pdf}}
\end{figure}

Next, we generate a random deformation field~$\phi$ to obtain a new anatomical map~\mbox{$L = S_i \circ \phi$}. The deformation field $\phi$ is the composition of an affine and a deformable random transform, $\phi_{aff}$ and $\phi_{v}$, parameterized by $\theta_{aff}$ and $\theta_v$, respectively:~$\theta_\phi=(\theta_{aff},\theta_v)$. The affine component is the composition of three rotations ($\theta_{rot}$), three scalings ($\theta_{sc}$), three shears ($\theta_{sh}$), and three translations ($\theta_{tr}$). All these parameters are independently sampled from 
continuous uniform distributions with predefined ranges: $\mathcal{U}(a_{rot},b_{rot})$, 
$\mathcal{U}(a_{sc},b_{sc})$,
$\mathcal{U}(a_{sh},b_{sh})$, 
and $\mathcal{U}(a_{tr},b_{tr})$, respectively.
The deformable component is a diffeomorphic transform, obtained by integrating a smooth, random stationary velocity field (SVF) with a scaling and squaring approach~\cite{moler_nineteen_2003,arsigny_log-euclidean_2006}, implemented efficiently for a GPU~\cite{dalca_unsupervised_2019-1,krebs_learning_2019}. The SVF is generated by first sampling the parameters $\theta_v$. This is a random, low-resolution tensor (size $c_v \times c_v \times c_v \times3$), where each element is a sample from a zero-mean Gaussian distribution with standard deviation $\sigma_{svf}$. This tensor is subsequently upscaled to the desired image resolution with trilinear interpolation, to obtain a smooth SVF, which is  integrated to obtain $\phi_v$. The final deformed label map is obtained by resampling
%
\begin{align}
L = S_i \circ \phi =  S_i \circ [\phi_{aff}(\theta_{aff})\circ \phi_v(\theta_v)]
\end{align}
%
with nearest neighbor interpolation. This generative model yields a wide distribution of neuroanatomical shapes, while ensuring spatial smoothness (\figureref{fig:augm_example}b).

Given the segmentation $L$, we sample a synthetic image $I$ as follows. First, we sample an image $G$ conditioned on $L$, following the likelihood model introduced in section \ref{bayesian}, one voxel at the time using~$G_j \sim \mathcal{N}(\mu_{L_j}, \sigma_{L_j}^2)$. The Gaussian parameters~$\{\mu_k, \sigma_k\}$ are a set of $K$ independent means and standard deviations drawn from continuous uniform distributions $\mathcal{U}(a_\mu,b_\mu)$ and $\mathcal{U}(a_\sigma,b_\sigma)$, respectively. Sampling independently from a wide range of values yields images of extremely diverse contrasts (\figureref{fig:augm_example}c). To mimic  partial volume effects, we make the synthetic images more realistic by introducing a small degree of spatial correlation between neighboring voxels. This is achieved by blurring $G$ with a Gaussian kernel $R(\sigma_{blur})$ with standard deviation $\sigma_{blur}$ voxels, i.e., $G^{blur} = G * R(\sigma_{blur})$ (\figureref{fig:augm_example}d).

We corrupt the images with a bias field $B$, parameterized by $\theta_B$.  $B$ is generated in a similar way as the SVF: $\theta_B$ is a random, low resolution tensor (size $c_B\times c_B \times c_B$ in our experiments), whose elements are independent samples of a Gaussian distribution $\mathcal{N}(0,\sigma_b)$. This tensor is upscaled to the image size of $L$ with trilinear interpolation, and the voxel-wise exponential is taken to ensure non-negativity. The bias field corrupted image $G^{bias}$ is obtained by voxel-wise multiplication: $G^{bias} = G^{blur} \odot B$ (\figureref{fig:augm_example}e). 

Finally, the training image $I$ is generated by standard gamma augmentation and normalization of intensities. We first sample $\gamma$  from a uniform distribution $\mathcal{U}(a_\gamma,b_\gamma)$ and then:
%
\begin{equation}
\label{eq:intensity_augm}
I_j = \left( [G^{bias}_j - \min_{j}(G^{bias}_j)] \bigg/ [\max_{j}(G^{bias}_j)-\min_{j}(G^{bias}_j)] \right)^{\gamma}.
\end{equation}

\paragraph{Training:} Starting from a set of label maps, we use the generative process described above to form training pairs~$\{I, L\}$ (\figureref{fig:augm_example}f). These pairs -- each sampled with different parameters -- are used to train the CNN in a standard supervised fashion~(\figureref{fig:schematic}). 


\begin{table}[tbp]
\setlength\tabcolsep{3pt} 
\floatconts
  {tab:hyperparameters}
  {\caption{Hyperparameters used in our experiments. Angles are in degrees;  spatial measures are in voxels. Intensity hyperparameters assume an input in the [0,255] interval.}  
  %\vspace{-10pt}
  }
  {\small \begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|}
  \hline
 $a_{rot}$ & $b_{rot}$ & $a_{sc}$ & $b_{sc}$ & $a_{sh}$ & $b_{sh}$ & $a_{tr}$ & $b_{tr}$ & $\sigma_{svf}$ & $a_\mu$ & $b_\mu$ & $a_\sigma$ & $b_\sigma$ &$\sigma_{blur}$ & $\sigma_b$ & $a_\gamma$ & $b_\gamma$ & $c_v$ & $c_B$ \\
  \hline
  -10 & 10 & 0.9 & 1.1 & -0.01 & 0.01 & -20 & 20 & 3 & 25 & 225 & 5 & 25 & 0.3 & 0.5 & -0.3 & 0.3 & 10 & 4\\
  \hline
  \end{tabular}}
\end{table}


\subsection{Implementation details}
\label{implementation details}

\textbf{Architecture:} We use a U-Net style architecture~\cite{ronneberger_u-net_2015,cicek_3d_2016} with 5 levels of 2 layers each. The first layer contains 24 feature maps, and this number is doubled after each max-pooling, and halved after each upsampling. Convolutions are performed with kernels of size $3\times3\times3$, and use the Exponential Linear Unit as activation function~\cite{clevert_fast_2016}. We also make use of batch-normalization layers before each max-pooling and upsampling layer~\cite{ioffe_batch_2015}. The last layer uses a softmax activation function. The loss function is the average soft Dice~\cite{milletari_v-net_2016} coefficient between the ground truth segmentation and the probability map corresponding to the predicted output.

\paragraph{Parametric distributions and intensity constraints:}
The proposed generative model involves several hyperparameters (described above), which control the priors of model parameters. 
In order to achieve invariance to input contrast, we sample the hyperparameters of the GMM (describing priors for intensity means and variances)  from wide ranges in an independent fashion, generally leading to  unrealistic images (\figureref{fig:augm_example}). The deformation hyperparameters are chosen to yield a wide range of shapes -- well beyond plausible anatomy. We emphasize that the hyperparameter values, summarized in \tableref{tab:hyperparameters}, are \textit{not} chosen to mimic a particular imaging modality or subject cohort. 

\paragraph{Skull stripping:} 
The proposed method is designed to segment brain MRI without any preprocessing. However, in practice, some brain MRI datasets do not include extracerebral tissue, for example due to privacy issues. We build robustness to skull-stripped images into our method, by treating all extracerebral regions as background in 20\% of training samples. 

\paragraph{GPU implementation:}Our model, including the image sampling process, is~\mbox{implemented} on the GPU in Keras \cite{chollet_keras_2015} with a Tensorflow backend \cite{abadi_tensorflow_2016}.