\synctex=1

\usepackage[utf8]{inputenc}
\usepackage{xr-hyper}
\usepackage{times}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{multirow}
\usepackage{acronym}


\usepackage{tikz}
\usepackage{pgfplots}
\usepackage{booktabs}
\usepackage{bbm}
\usepackage{siunitx}
\usepackage{environ}


\usepackage{caption} \captionsetup[table]{skip=1pt}
\usepackage{floatrow}
\usepackage{booktabs}
\newfloatcommand{capbtabbox}{table}[][\FBwidth]
\usepackage{blindtext}
\def\NoNumber#1{{\def\alglinenumber##1{}\State#1}\addtocounter{ALG@line}{-1}}
\setlength{\parskip}{1.5mm plus0mm minus1mm}


\usepackage{amsmath}
\newcommand{\ttt}{\boldsymbol{\theta}}
\newcommand{\lbold}{\boldsymbol{\lambda}}
\newcommand{\s}{\boldsymbol{s}}
\newcommand{\ff}{\boldsymbol{f}}
\newcommand{\real}{\mathbb{R}}
\newcommand{\y}{\mathbf{y}}
\newcommand{\mc}{\mathcal}
\newcommand{\mb}{\mathbb}
\providecommand{\etal}{\textit{et at.}}

\usepackage{soul}

\usetikzlibrary{intersections}
\usepgfplotslibrary{fillbetween}

\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\DeclareMathOperator{\DICE}{DICE}

\newtheorem{Defn}{Definition}
\newtheorem{Prop}{Proposition}


\usepackage{comment}

\usepackage{xparse}
\usepackage{csvsimple}
\usepackage{pgf}
\usepackage{ifdraft}
\usepackage{catchfile}

\pgfplotsset{compat=1.15}

\newcommand{\mhl}[1]{\textcolor{red}{\textbf{\hl{#1}}}}


\newacro{uda}[UDA]{unsupervised domain adaptation}
\newacro{da}[DA]{domain adaptation}
\newacro{mmd}[MMD]{maximum mean discrepancy}
\newacro{mri}[MRI]{magnetic resonance imaging}
\newacro{gm}[GM]{gray matter}
\newacro{wm}[WM]{white matter}
\newacro{csf}[CSF]{cerebrospinal fluid}
\newacro{rbf}[RBF]{radial basis function}
\newacro{kl}[KL]{Kullback-Leibler}
\newacro{pp}[pp]{percentage points}

\begin{document}

\input{data_macros}

\title{On Direct Distribution Matching for Adapting Segmentation Networks}

\maketitle


\begin{abstract}

\input{abstract}


 \end{abstract}

\begin{keywords}
  domain adaptation, unsupervised domain adaptation, semantic segmentation, direct distribution matching
\end{keywords}

\section{Introduction}




Semantic segmentation is of pivotal importance towards high-level understanding of image content, which is useful in a breadth of application areas, from autonomous driving to health care, for instance.
Particularly, in medical imaging, segmentation facilitates clinical tasks, including disease diagnosis, treatment and follow-up, among others. Modern medical segmentation approaches rely on deep learning techniques, which 
have demonstrated outstanding performance in a breadth of applications \cite{dolz20183d,dolz2018hyperdense,litjens2017survey}. Despite their success, generalization of trained models to new scenarios is hampered
if the gap between data distributions across domains is large. A trivial solution to address this issue would be to re-annotate images from different domains and re-train or fine-tune the deep models. Nevertheless, 
obtaining such massive amounts of labeled data is a cumbersome process which, for some applications, may require user expertise, resulting in a prohibitive and unrealistic approach. 
 
To tackle this problem, \ac{uda} techniques have been widely investigated. These methods aim at learning robust classifiersin the 
presence of a \textit{shift} between source and target distributions when the target data is unlabeled.
In this scenario, the goal is typically to minimize the discrepancy between distributions across domains at the input \cite{bousmalis2017unsupervised,chen2018semantic,hoffman2017cycada,russo2018source,sankaranarayanan2017generate,wu2018dcan} or intermediate-feature level \cite{ganin2015unsupervised,ghifary2016deep,kamnitsas2017unsupervised,liu2018detach,long2015learning,long2016unsupervised,tzeng2017adversarial}, while leveraging labeled source examples to retain discriminative power on the feature space.
Generative techniques either operate on a pixel-level \cite{bousmalis2017unsupervised,chen2018semantic,russo2018source,shrivastava2017learning,zhang2018task} or in feature space \cite{dou2018unsupervised,ganin2015unsupervised,kamnitsas2017unsupervised,long2015learning,tzeng2017adversarial} and align the image appearance between domains, so that the target data ``style'' is transferred to source data, or vice-versa. 
Then, supervised learning is performed with the newly generated synthetic data. A downside of these approaches is that they perform satisfactorily only 
for small images and narrow domain shifts, which limits their applicability. Within the current paradigm of learning domain-invariant representations, domain adversarial training \cite{ganin2015unsupervised,tzeng2017adversarial} and \ac{mmd} \cite{long2015learning,sun2016deep,yan2017mind} have become very popular choices. 

For semantic segmentation problems, adversarial training models \cite{goodfellow2014generative} are currently dominating the literature \cite{chen2017no,chen2018road,dou2018pnp,kamnitsas2017unsupervised,hoffman2016fcns,hong2018conditional,saito2018maximum,tsai2018learning,vu2018advent}. 
Such models alternate the training of two networks: a discriminator that learns a decision boundary between source and target features and a segmentation network 
that uses the learned decision boundary to match the feature distributions across domains.
Some other approaches rely on generative networks, which yield target images conditioned on the source, or vice-versa, aligning both domains at the pixel level \cite{cai2019towards,huo2018synseg,murez2018image,sankaranarayanan2017generate,zhang2018translating,zhao2019supervised}. 

While adversarial training achieved outstanding performances in image classification, our numerical evidence and intuition suggest that it may not be suitable for segmentation tasks to the same degree.
First, learning a discriminator boundary for a segmentation task is much more complex as the label space is exponentially large. Intuitively, 
a high dimensional label space implies that the discriminator boundary can be very complicated and thus hard to learn. Therefore, as we 
will see later in our experiments, alternating both adversarial and prediction tasks in segmentation might cause more significant training instabilities than in image classification tasks.
Moreover, it is more unlikely that source and target domain share the same 
multi-level feature representations if the label space is high dimensional.

While the inputs can differ significantly from one domain to another, the output (label) space in semantic segmentation conveys very rich information 
related to the spatial layout and local context, which is shared across domains. Inspired by this observation, Tsai \etal \cite{tsai2018learning} proposed 
adversarial training in the output (softmax segmentation) space, achieving better performance than features-matching approaches on the Cityscapes dataset. Leveraging this information is even more 
meaningful in medical images, where label (output) statistics remain domain-independent, despite significant differences in image inputs across domains. 
Nevertheless, following the trend in \ac{uda} approaches for natural image segmentation, adversarial learning has become the \textit{de facto} choice in medical image segmentation \cite{chen2019synergistic,dou2018pnp,gholami2018novel,javanmardi2018domain,kamnitsas2017unsupervised,zhang2018task,zhao2019supervised}.
It is worth mentioning that some recent natural image segmentation works \cite{zhang2017curriculum,zou2018unsupervised} pointed out that adversarial models for classification
do not translate well to segmentation. These studies showed that similar or better performances can be achieved by other alternatives.

Here, we propose a simple, easily trainable approach to \ac{uda}, that can be applied in cases where the underlying (latent) ground truth is identical for source and target domains, up to some geometric transformations of unlabeled images. While unrealistic for natural images, this can easily be achieved in medical imaging, e.g., by obtaining separate scans of one patient with different imaging methods or by applying multi-modal registration algorithms to unlabeled image pairs.
The class of loss functions we propose encourages direct density matching in the network's output space. It follows the principle of Minimization of distribution matching losses, a principled approach to \ac{da} in the context of image classification, e.g., \ac{mmd} \cite{long2015learning,sun2016deep,yan2017mind}.
Rather than using an intermediate domain discriminator, our direct approach unifies distribution matching and segmentation in a single loss. Therefore, 
it simplifies segmentation adaptation by avoiding extra adversarial steps, while improving quality, stability and efficiency of training. We compare our approach to the state-of-art segmentation method in \cite{tsai2018learning}. In the challenging task of adapting brain segmentation across different \ac{mri} modalities, our approach achieves significantly better performance than adversarial output adaption, both in terms of accuracy and stability. We also investigate experimentally the sensitivity of our approach to the alignment of unlabeled image pairs.



\section{Formulation}
\label{sec:method}

Consider an unsupervised domain-adaptation setting with two distinct subsets: $\mathcal L = \{(X_i, Y_i)\}_{i = 1, \dots, n}$ contains labeled source-domain images $X_i$ and 
the corresponding ground-truth segmentations $Y_i$, and $\mathcal U = \{(X_{i}, X'_{i})\}_{i = n+1, \dots, n+m}$ contains {\em unlabeled} image pairs, each involving a source 
image $X_i$ and a target image $X'_i$. For each labeled source image $X_i: \Omega \subset {\mathbb Z}^{2,3} \rightarrow \mathbb{R}$, $ i = 1, \dots, n$, the ground-truth labeling
$Y_i \in \{0, 1\}^{L \times |\Omega|}$ is a matrix whose columns are binary vectors, encoding the assignment of pixel $p \in \Omega$ to one of $L$ classes (segmentation regions): 
${\mathbf y}_i(p) = (y_i(1,p), \dots, y_i (L,p)) \in \{0,1\}^L$, where $y_i(l,p) = 1$ if and only if label $l$ is assigned to pixel $p$ of the $i$-th image. 
For any image $X$,
let ${\mathbf s}_{\mathbf{\theta}}(p, X) = \big(s_{\mathbf{\theta}}(1, p, X), \dots, s_{\mathbf{\theta}} (L, p, X)\big) \in [0,1]^L$ denote the probability vector of
softmax outputs for pixel $p$, with $\mathbf{\theta}$ the trainable parameters of the network. For the sake of simplicity, we will omit the subscript $\mathbf{\theta}$ in the following. 

We propose to minimize the following loss function:
\begin{align}
\label{density-matching-loss}
  {\cal F}(\theta) &= \sum_{i=1}^n \sum_{p \in \Omega} {\cal H}\big({\mathbf y}_i (p), {\mathbf s} (p, X_i)\big)
               + \lambda \sum_{i=n+1}^{n+m} \sum_{p \in \Omega} {\cal D} \big({\mathbf s} (p, X_i), {\mathbf s} (p, {\cal T}(X'_i))\big) ,
\end{align}
where
\begin{itemize}
\item $\mathcal D(\mathbf s,\mathbf s')$ evaluates the discrepancy between two probability 
  distributions $\mathbf s$ and $\mathbf s'$, e.g., \ac{kl} divergence $\mathcal D_{\mathrm{KL}}(\mathbf s, \mathbf s') = {\mathbf s}^T \ln \frac{{\mathbf s}}{{\mathbf s}'}$, where superscript ${{}\cdot{}}^T$ denotes transposition.
          
\item ${\cal H}$ denotes standard cross-entropy loss for labeled source-domain images: ${\cal H}(\mathbf y, \mathbf s) = \mathcal D_{\mathrm{KL}}(\mathbf y, \mathbf s)$.
\item $\lambda$ is a non-negative multiplier.
\item ${\cal T}$ could be simply identity if unlabeled images $X_i$ and $X'_i$ are aligned, e.g., by acquisition\footnote{In some practical scenarios, images from different modalities are aligned when acquired at the same time.}. Also, ${\cal T}$ could be a geometric transformation, which aligns pairs of unlabeled images, for instance, using a standard automatic cross-modality registration algorithm \cite{Oliveira-Tavares-14}. 
\end{itemize}

\noindent The first term in our model \eqref{density-matching-loss} is the usual cross-entropy loss of a semantic segmentation problem on the source domain, while the second term, which is based on unlabeled image pairs, encourages the network outputs (softmax segmentations) in the target domain to closely match those in the source domain.
In fact, when $\cal D$ corresponds to some kernel function, i.e., ${\cal D}(\cdot,\cdot) = - {\cal K}(.,.)$, the summation over pixels in the second term of \eqref{density-matching-loss} can be expressed in terms of a kernel $\tilde{{\cal K}}$ between two softmax segmentations in $\{0,1\}^{L \times |\Omega|}$:
\begin{equation}
\label{kernel-segmentation}
-\tilde{{\cal K}} \big(S(X_i), S({\cal T}(X'_i)) = \sum_{p \in \Omega} {\cal D} ({\mathbf s}(p, X_i), {\mathbf s}(p, {\cal T}(X'_i))\big)    
\end{equation}
with $S(X) \in \{0,1\}^{L \times |\Omega|}$ denoting the matrix whose columns are the softmax outputs at each pixel, i.e., probability vectors ${\mathbf s}(p, X)$.
Now, notice that the {\em kernel density estimate (KDE)}\footnote{KDEs are also commonly referred to as Parzen window estimates.} of the distribution of source-domain softmax 
segmentations, i.e., the network outputs in $\{0,1\}^{L \times |\Omega|}$, can be written as follows:
${\cal P}(S(X)) \propto \sum_{i=n+1}^m \tilde{{\cal K}} (S(X_i), S(X)), \, \forall X$.
Therefore, by maximizing these source density estimates at target-domain segmentations, we directly 
match the distributions of the source and target domains in the network-output space. This amounts to minimizing
the following direct distribution-matching loss:
\begin{equation}
\label{full-pairwise-kernels}
-\sum_{j=n+1}^m{\cal P}(S({\cal T}(X'_j))=-\sum_{i,j=n+1}^m \tilde{{\cal K}}(S(X_i),S({\cal T}(X'_j))).
\end{equation}
Clearly, from the expression of kernel $\tilde{{\cal K}}$ in \eqref{kernel-segmentation}, the second term in our loss in \eqref{density-matching-loss} can be viewed as an 
approximation of \eqref{full-pairwise-kernels} based on a subset of pairwise matching kernels.
Therefore, our loss in \eqref{density-matching-loss} encourages direct density matching in the network-output space. 

Fig.~\ref{fig:conceptualcomparison} highlights the conceptual differences between our direct matching (Fig.~\ref{fig:conceptualcomparison:c}) and the state-of-art adversarial
method in \cite{tsai2018learning}, which pursues a two-step adversarial learning in the network-output space (Fig.~\ref{fig:conceptualcomparison:a} and~\ref{fig:conceptualcomparison:b}), so as to achieve the same 
goal as our loss: matching the source and target distributions of label predictions.
\begin{figure*}[h!]
  \centering
  \subfigure[Adversarial (Discriminator)]{
    \begin{minipage}{.33\linewidth}
      \centering
            \includegraphics[width=.9\textwidth]{Discriminator2.png}
      \vspace{5pt}
      \label{fig:conceptualcomparison:a}
    \end{minipage}
  }%
    \subfigure[Adversarial (Segmenter)]{
    \begin{minipage}{.33\linewidth}
      \centering
      \includegraphics[width=.85\textwidth]{Generator.png}
      \vspace{5pt}
      \label{fig:conceptualcomparison:b}
    \end{minipage}
  }%
    \subfigure[Direct Distribution Matching]{
    \begin{minipage}{.32\linewidth}
      \centering
      \includegraphics[width=.85\textwidth]{Direct_Distribution_Matching2.png}
      \vspace{5pt}
      \label{fig:conceptualcomparison:c}
    \end{minipage}
  }%
  \caption{A conceptual juxtaposition of adversarial training in the network-output space \cite{tsai2018learning} (Fig.~\ref{fig:conceptualcomparison:a} and~\ref{fig:conceptualcomparison:b}) and our direct density matching (Fig.~\ref{fig:conceptualcomparison:c}).
    The data points in the figure depict networks outputs 
    (softmax segmentations), with the blue points
    corresponding to the source and dashed points to the target. }
\label{fig:conceptualcomparison}
\end{figure*}
The model in \cite{tsai2018learning} alternates the training of two networks: a discriminator,
which learns to distinguish between source and target outputs; and a segmentation network, which is trained using the discriminator. The discriminator is used to encourage the target outputs to be similar to those of the source domain.
Rather than using an intermediate domain discriminator, our direct method unifies distribution matching and segmentation in a single loss. Therefore, it simplifies segmentation adaptation by avoiding extra adversarial steps, while improving both the quality, stability and efficiency of training. While adversarial training achieved outstanding performances in image classification, our numerical evidence and intuition suggest that it may not be suitable for segmentation, in which case learning a discriminator boundary 
is much more complex as it solves for predictions in an exponentially large label space. In fact, intuitively, a large label space implies large spaces of possible solutions for discriminator boundaries and target predictions, both of which are latent; see dashed boundaries and data points in Fig.~\ref{fig:conceptualcomparison:a}. Alternating both adversarial and prediction tasks in segmentation can cause more significant instabilities than in image classification tasks, as we will see later in our experiments.

Another important difference between our approach and adversarial training is that we account for the fact that target and source data have a common ground truth in the label space, up to some geometric transformation. Such prior information is very common and useful in medical imaging problems, but adversarial approaches do not have mechanisms to take advantage of it.       
 

\section{Experiments}



We evaluated our approach extensively on the challenging task of brain tissue segmentation in \ac{mri} scans, and compared the performances to the state-of-the-art method in \cite{tsai2018learning}.






\subsection{Experimental details}


\hspace{0.5cm}{\bfseries{Datasets:}}
We performed numerical studies on two public segmentation benchmarks: MRBrainS2013 \cite{mendrik2015mrbrains} and iSEG2017 \cite{iSEG}. The MRBrainS dataset contains 5 labeled and 15 unlabeled scans of adult brains. The iSEG dataset is composed of 10 labeled and 13 unlabeled infant brain scans. 
We tested our domain adaptation on the T1 and T2-FLAIR modalities of MRBrainS and the T1 and T2 modalities of iSEG. The task consists of segmenting the \ac{wm}, \ac{gm} and \ac{csf}. The original T2 images from iSEG were resampled into an isotropic $1 \times 1 \times 1 \si{\cubic\milli\meter}$ resolution, and then aligned onto their corresponding T1 images with a simple affine registration method. The sequences from the MRBrainS Challenge were aligned by rigid registration, using Elastix \cite{klein2010elastix}. 


{\bfseries{Training:}}
The data $(\mathcal L, \mathcal U)$ consists of two distinct subsets. $\mathcal L = \{(X_1, Y_1), \dots ,(X_n, Y_n)\}$
is the labeled subset, which contains images $X_i$ from the source domain with their corresponding ground truth $Y_i$. Unlabeled subset $\mathcal U = \{(X_{n+1}, X'_{n+1}), \dots ,(X_{n+m}, X'_{n+m})\}$ contains pairs of aligned source and target data, respectively, without a ground-truth.
In our experiments, we found that the choice of distance functions $\mathcal D$ does not significantly alter performances.
If not mentioned otherwise, we used \ac{kl} divergence $\mathcal D(\mathbf s, \mathbf s') = {\mathbf s}^T \ln \frac{{\mathbf s}}{{\mathbf s}'}$ and the multiplier $\lambda = \deflambda$.


Due to the limited size of the training set, we employed a leave-one-out-cross-validation strategy, where only one image was used for testing/evaluation, leaving the remaining images for training. We used four of the five labeled scans in the MRBrainS2013 dataset as samples in $\mathcal{L}$. The one remaining scan was used for evaluation. As the iSEG dataset contains more labeled scans, we opted to use 8 scans for training in $\mathcal{L}$ and one scan for testing and evaluation, respectively. Furthermore, all unlabeled scans, i.e., 15 and 13 in the MRBrainS2013 and iSEG datasets, respectively, are used in $\mathcal{U}$ to compute the unsupervised term in \eqref{density-matching-loss}. Each experiment was performed three times with different evaluation/testing data splits and the average as well as the empirical standard deviation reported subsequently were computed over these three runs. 

{\bfseries{Baselines and comparisons:}}
In order to evaluate the impact of the adaptation approaches, we trained the segmentation network in a supervised manner on the source and target data, providing a lower and upper bound for the \ac{uda} results. While the network trained on source images is referred to as \emph{no adaptation}, the network trained on the target domain is referred to as the \emph{oracle}.
In addition, we compare the proposed approach with the adversarial method proposed in \cite{tsai2018learning}. For a fair comparison, we used the same segmentation network for the proposed and the adversarial approach. For simplicity, we chose the ``single-level'' strategy, performing \ac{da} only on the output layer. We used the same discriminator model as~\cite{tsai2018learning}. The Lagrange multiplier for training the segmentation network was chosen to be $\lambda_{\mathrm{adv}} = \defadaptlambda$.
Although AdaptSegNet does not utilize the fact that source and target data are aligned in $\mathcal{U}$, we nevertheless trained the discriminator with these aligned pairs. Subsequent runs indeed revealed that this does not have an impact on the performance of AdaptSegNet.

{\bfseries{Implementation details:}}
We used a slightly modified U-Net \cite{Ronneberger2015U} for the segmentation task, operating on 2D slices. Particularly, the employed network follows the original implementation \cite{Ronneberger2015U}, but the depth is reduced by one, i.e., max-pool is performed only three times instead of four.
We used ReLU activation functions and did not include dropout, to avoid any regularization that does not originate from our proposed \ac{da} strategy. To obtain 2D input, the 3D images are sliced along the z-axis. However, Dice coefficients are computed on the 3D scans.
The implementation was done in TensorFlow, and the experiments were run on a server equipped with a NVidia Titan V GPU with 12 GB memory. For all networks, we employed the Adam~\cite{Kingma2014Adam} optimizer with learning rate $\mathrm{lr} = 0.0001$ and a batch size of $32$. We performed fully supervised pre-training for 200 epochs on the source domain data. Subsequently, we trained for 800 epochs with the full loss \eqref{density-matching-loss}, totaling 1000 training epochs. 
The code is publicly available at \url{https://github.com/g-pichler/DDMSegNet}.


{\bfseries{Evaluation:}}
We resorted to the common Dice coefficient, widely employed in medical image segmentation, to compare quantitatively the performances of the different methods.
When using the iSEG dataset, the mean Dice coefficient on the test scan was used to determine the best model during training. We then report the performance of this model on the evaluation sample. Due to the limited size of the MRBrainS2013 dataset, here, the testing and evaluation sets are identical.

We report Dice coefficients in percent and when comparing the performance of two models, we refer to the absolute difference in \ac{pp}.






\begin{table*}[ht!]
  \footnotesize
  \caption{\ac{da} results on MRBrainS and iSEG dataset, showing the Dice coefficient over the three classes (i.e., \ac{gm}, \ac{wm} and \ac{csf}) as well as the mean. Coefficients are given in percent.}
  \makebox[\textwidth][c]{
    \begin{tabular}{LLLCCCZ}
      \toprule
      &       &        & \multicolumn{4}{c}{Mean Dice}              \\
      \midrule
      &          &      & Oracle                        & No adaptation                 & AdaptSegNet  & Proposed                      \\ 
      \midrule
      Source   & Target   &      & Target$\longrightarrow$Target & Source$\longrightarrow$Target & Source$\longrightarrow$Target                        & Source$\longrightarrow$Target \\
      \midrule
      MRB (T1) & MRB (T2-FLAIR) &
                                  \printFullRow{%
                                  data-mrbrains13/model-DDM/unsup-False/src-T2_FLAIR/targ-T2_FLAIR/lambda-\deflambda/loss-\defloss/shuf-False/eval-False,
                                  data-mrbrains13/model-DDM/unsup-False/src-T1/targ-T2_FLAIR/lambda-\deflambda/loss-\defloss/shuf-False/eval-False,
                                  data-mrbrains13/model-AdaptSegNet/unsup-True/src-T1/targ-T2_FLAIR/lambda-\defadaptlambda/loss-None/shuf-False/eval-False,
                                  data-mrbrains13/model-DDM/unsup-True/src-T1/targ-T2_FLAIR/lambda-\deflambda/loss-\defloss/shuf-False/eval-False}[{&&}][TARG_DC,SRC_DC] \\
      \midrule
      MRB (T2-FLAIR) & MRB (T1) &
                                  \printFullRow{%
                                  data-mrbrains13/model-DDM/unsup-False/src-T1/targ-T1/lambda-\deflambda/loss-\defloss/shuf-False/eval-False,
                                  data-mrbrains13/model-DDM/unsup-False/src-T2_FLAIR/targ-T1/lambda-\deflambda/loss-\defloss/shuf-False/eval-False,
                                  data-mrbrains13/model-AdaptSegNet/unsup-True/src-T2_FLAIR/targ-T1/lambda-\defadaptlambda/loss-None/shuf-False/eval-False,
                                  data-mrbrains13/model-DDM/unsup-True/src-T2_FLAIR/targ-T1/lambda-\deflambda/loss-\defloss/shuf-False/eval-False}[{&&}][TARG_DC,SRC_DC]\\
      \midrule
      iSEG (T1) & iSEG (T2) &
                              \printFullRowEval{%
                              data-iseg/model-DDM/unsup-False/src-T2/targ-T2/lambda-\deflambda/loss-\defloss/shuf-False/eval-True,
                              data-iseg/model-DDM/unsup-False/src-T1/targ-T2/lambda-\deflambda/loss-\defloss/shuf-False/eval-True,
                              data-iseg/model-AdaptSegNet/unsup-True/src-T1/targ-T2/lambda-\defadaptlambda/loss-None/shuf-False/eval-True,
                              data-iseg/model-DDM/unsup-True/src-T1/targ-T2/lambda-\deflambda/loss-\defloss/shuf-False/eval-True}[{&&}][TARG_DC,SRC_DC]\\
      \midrule
      iSEG (T2) & iSEG (T1) &
                              \printFullRowEval{%
                              data-iseg/model-DDM/unsup-False/src-T1/targ-T1/lambda-\deflambda/loss-\defloss/shuf-False/eval-True,
                              data-iseg/model-DDM/unsup-False/src-T2/targ-T1/lambda-\deflambda/loss-\defloss/shuf-False/eval-True,
                              data-iseg/model-AdaptSegNet/unsup-True/src-T2/targ-T1/lambda-\defadaptlambda/loss-None/shuf-False/eval-True,
                              data-iseg/model-DDM/unsup-True/src-T2/targ-T1/lambda-\deflambda/loss-\defloss/shuf-False/eval-True}[{&&}][TARG_DC,SRC_DC]\\
      \bottomrule           
    \end{tabular}
  }
  \label{tab:results}
\end{table*}

  
\subsection{Results}
\label{sssec:results}

Table~\ref{tab:results} reports the class-specific and mean Dice coefficients in percent.
Looking at the results achieved by the oracle, one can observe that, without adaptation, the performance drops dramatically, particularly for \ac{wm}. The adversarial adaptation strategy proposed in \cite{tsai2018learning}, AdaptSegNet, is able to infer target domain information during learning and to recovers segmentation performance. For example, when shifting from T1 to T2, AdaptSegNet improves the mean performance by at least
$\pgfmathprintnumber[fixed,precision=1,]{\AdaptSegNetVSNoAdapt}\ac{pp}$ in comparison to \textit{no adaptation}, in both MRBrainS and iSEG images. Despite this improvement, there is still a considerable gap of at least $\pgfmathprintnumber[fixed,precision=1,]{\AdaptSegNetVSOracle}\ac{pp}$ compared to the oracle.
On the other hand, the increased performance achieved by our method is more pronounced, getting closer to the performance of the oracle. Particularly, in all the four settings, differences with respect to training the network on target images and our method are in the range between
$\pgfmathprintnumber[fixed,precision=1,]{\DDMvsOracleMIN}\ac{pp} - \pgfmathprintnumber[fixed,precision=1,]{\DDMvsOracleMAX}\ac{pp}$. Furthermore, in most cases, the standard deviation is largely decreased by employing the proposed approach rather than the adversarial method. Another interesting finding when independently analyzing the class-specific results is that the proposed method reliably follows the behavior of the oracle. For each of the four analyzed settings, the class segmentation rank for both oracle and proposed approach remains the same.

Qualitative results of these models are depicted in Figure \ref{fig:visual}. Specifically, cross-sectional 2D \ac{mri} scans of two given patients are shown, for both source and target domains, along with the corresponding ground truth and segmentation masks obtained by the different models. We can observe that if no adaptation method is applied, the model trained on the source domain completely fails to segment the target image. Including an adaptation adversarial module visually improves the segmentation, which aligns with the numerical values reported in Table \ref{tab:results}. Having a closer look to the AdaptSegNet segmentation, we observe that while the \ac{csf} (in brown) seems to correlate with the ground truth, both \ac{wm} and \ac{gm} (in yellow and green, respectively) only capture global information, being imprecise in local details. This can be due to the fact that appearance of this particular structure remains similar across domains, whereas intensity distribution of white and \ac{gm} highly differ between source and target domains. Indeed, this observation also holds for the \textit{no adaptation} setting, where \ac{csf} segmentation obtains the best performance for \ac{da} on MRBrainS. Contrary, the proposed direct distribution matching method is able to correctly capture differences between images, satisfactorily adapting both domains.  


\begin{figure*}[h!]
\centering
\includegraphics[width=\textwidth]{ICCV-2019_2.png}
\caption{Visual results for two MRBrainS subjects achieved by the different models in the case of adapting a T1-trained model to T2-FLAIR images (\textit{top}), and a T2-FLAIR-trained model to T1 images (\textit{bottom}). These images were randomly selected from one of the three runs.}
\label{fig:visual}
\end{figure*}




                


\subsubsection{Sensitivity to image disalignment.}
\label{sec:disalignment}


Our proposed method assumes perfectly aligned images between the source and the target domain in the unlabeled training set $\mathcal U$.
In order to test the sensitivity of our approach to a violation of this assumption of alignment between $X$ and $X'$, we should pair scans of different individuals in $\mathcal{U}$. As the datasets are small, instead, we deliberately shuffled the unsupervised pairs using a cyclic shift, and then performed our experiments with the modified unlabeled training data $\mathcal U = \{(X_{n+1}, X'_{n+2}), \dots ,(X_{n+m-1}, X'_{n+m}) ,(X_{n+m}, X'_{n+1})\}$.
However, in order to avoid a misalignment due to the imaging procedure, we did perform a 3D affine registration using the \href{http://www.simpleitk.org/}{SimpleITK} software package, registering $X'_{n+i+1}$ to $X_{n+i}$ using mutual information~\cite{mattes2003pet} as optimization metric.

The results are detailed in Table~\ref{tab:results:shuf_with_reg}.  When adapting from T1 to T2, the proposed approach achieves similar results than the adversarial method, even offering a slight increase of $\pgfmathprintnumber[fixed,precision=1,]{\TiAdaptSegNetVSDDMShufISEG}\ac{pp}$ and $\pgfmathprintnumber[fixed,precision=1,]{\TiiAdaptSegNetVSDDMShufISEG}\ac{pp}$ in the iSEG and MRBrainS datasets, respectively.
On the iSEG dataset neither method substantially outperforms the \emph{no adaptation} strategy, both AdaptSegNet and the proposed method being within $\pgfmathprintnumber[fixed,precision=1,]{\NoAdaptVSAdaptSegNetVSDDMShufISEG}\ac{pp}$.
Only when adapting from T2 to T1 on the MRBrainS dataset, while still improving substantially upon \emph{no adaptation}, AdaptSegNet outperforms the proposed approach by $\pgfmathprintnumber[fixed,precision=1,]{\AdaptSegNetVSDDMShufMRB}\ac{pp}$.

While AdaptSegNet does not leverage the alignment between images, while our proposed approach is built upon the assumption of perfect image alignment. However, the data in Table~\ref{tab:results:shuf_with_reg} suggest that yet the proposed approach might sill be useful if the alignment between the domains is not perfect and, e.g., achieved by a pre-registration step.





\begin{table*}[ht!]
  \small
  \caption{Mean Dice coefficient in percent when there is misalignment between the images, but an affine registration is performed prior to training.}
  \makebox[\textwidth][c]{
    \begin{tabular}{LCCCZ}
                  \toprule
       & Oracle & No adaptation & AdaptSegNet  & Proposed   \\
                  \midrule
      MRBrainS, T1 $\to$ T2
                                    \printSingleRow{%
                                    data-mrbrains13/model-DDM/unsup-False/src-T2_FLAIR/targ-T2_FLAIR/lambda-\deflambda/loss-\defloss/shuf-False/eval-False,
                                    data-mrbrains13/model-DDM/unsup-False/src-T1/targ-T2_FLAIR/lambda-\deflambda/loss-\defloss/shuf-False/eval-False,
                                    data-mrbrains13_aligned/model-AdaptSegNet/unsup-True/src-A_T1/targ-B_T2_FLAIR/lambda-\defadaptlambda/loss-None/shuf-False/eval-False,
                                                                        data-mrbrains13_aligned/model-DDM/unsup-True/src-A_T1/targ-B_T2_FLAIR/lambda-\deflambda/loss-\defloss/shuf-False/eval-False,
                                                                        }[TARG_DC,SRC_DC][][]  \\
            MRBrainS, T2 $\to$ T1
                                    \printSingleRow{%
                                    data-mrbrains13/model-DDM/unsup-False/src-T1/targ-T1/lambda-\deflambda/loss-\defloss/shuf-False/eval-False,
                                    data-mrbrains13/model-DDM/unsup-False/src-T2_FLAIR/targ-T1/lambda-\deflambda/loss-\defloss/shuf-False/eval-False,
                                    data-mrbrains13_aligned/model-AdaptSegNet/unsup-True/src-A_T2_FLAIR/targ-B_T1/lambda-\defadaptlambda/loss-None/shuf-False/eval-False,
                                                                        data-mrbrains13_aligned/model-DDM/unsup-True/src-A_T2_FLAIR/targ-B_T1/lambda-\deflambda/loss-\defloss/shuf-False/eval-False,
                                                                        }[TARG_DC,SRC_DC]  \\
            iSEG, T1 $\to$ T2
                            \printSingleRow{%
                            data-iseg/model-DDM/unsup-False/src-T2/targ-T2/lambda-\deflambda/loss-\defloss/shuf-False/eval-True,
                            data-iseg/model-DDM/unsup-False/src-T1/targ-T2/lambda-\deflambda/loss-\defloss/shuf-False/eval-True,
                            data-iseg_aligned/model-AdaptSegNet/unsup-True/src-A_T1/targ-B_T2/lambda-\defadaptlambda/loss-None/shuf-False/eval-True,
                                                        data-iseg_aligned/model-DDM/unsup-True/src-A_T1/targ-B_T2/lambda-\deflambda/loss-\defloss/shuf-False/eval-True,
                                                        }[TARG_DC,SRC_DC][][EVAL_]  \\
            iSEG, T2 $\to$ T1
                            \printSingleRow{%
                            data-iseg/model-DDM/unsup-False/src-T1/targ-T1/lambda-\deflambda/loss-\defloss/shuf-False/eval-True,
                            data-iseg/model-DDM/unsup-False/src-T2/targ-T1/lambda-\deflambda/loss-\defloss/shuf-False/eval-True,
                            data-iseg_aligned/model-AdaptSegNet/unsup-True/src-A_T2/targ-B_T1/lambda-\defadaptlambda/loss-None/shuf-False/eval-True,
                                                        data-iseg_aligned/model-DDM/unsup-True/src-A_T2/targ-B_T1/lambda-\deflambda/loss-\defloss/shuf-False/eval-True,
                                                        }[TARG_DC,SRC_DC][][EVAL_]  \\
      \bottomrule
    \end{tabular}
  }
  \label{tab:results:shuf_with_reg}
\end{table*}



  


{\bfseries{Training stability:}}
In addition to segmentation performance, we juxtaposed our method to the adversarial approach in terms of learning stability. Fig. \ref{fig:dice} depicts the testing evolution of the mean 3D Dice for AdaptSegNet and our approach, evaluated every 5 epochs. In both datasets, training is very unstable for the adversarial approach.
As a consequence, the performance can differ drastically depending on the number of training epochs and the stopping criterion. On the other hand, the proposed method shows a significantly better stability, smoothly converging during training. 

\begin{figure}[h!]
  \centering
  \subfigure[\scriptsize{MRBrainS: T1$\to$T2-FLAIR}]{
    \begin{scaletikzpicturetowidth}{.4\textwidth}
      \scalebox{\tikzscale}{
        \begin{tikzpicture}          \begin{axis}[ymin=0.0, ymax=0.85,
            grid=both, minor tick num=2,
            axis lines=center,
            legend style={at={(0.01,0.01)},anchor=south west},
            xlabel=epoch,
            x label style={anchor=north east}]
            \addCSVplot[blue, name path=A1, smooth]            {\defaultdir/data-mrbrains13/model-AdaptSegNet/unsup-True/src-T1/targ-T2_FLAIR/lambda-\defadaptlambda/loss-None/shuf-False/eval-False}{TARG_DC}{amin};
            \addlegendentry{AdaptSegNet};
            \addCSVplot[red, name path=A2, smooth]            {\defaultdir/data-mrbrains13/model-DDM/unsup-True/src-T1/targ-T2_FLAIR/lambda-\deflambda/loss-\defloss/shuf-False/eval-False}{TARG_DC}{amin};
            \addlegendentry{Proposed};
            \addCSVplot[blue, name path=B1,smooth]            {\defaultdir/data-mrbrains13/model-AdaptSegNet/unsup-True/src-T1/targ-T2_FLAIR/lambda-\defadaptlambda/loss-None/shuf-False/eval-False}{TARG_DC}{amax};
            \addCSVplot[red, name path=B2, smooth]            {\defaultdir/data-mrbrains13/model-DDM/unsup-True/src-T1/targ-T2_FLAIR/lambda-\deflambda/loss-\defloss/shuf-False/eval-False}{TARG_DC}{amax};
            \tikzfillbetween[of=A1 and B1]{blue, opacity=0.2};
            \tikzfillbetween[of=A2 and B2]{red, opacity=0.2};
          \end{axis}
        \end{tikzpicture}}
    \end{scaletikzpicturetowidth}
    \label{fig:dice-mrbrains}
  }
  \hfill
  \subfigure[\scriptsize{iSEG: T1$\to$T2}]{
    \begin{scaletikzpicturetowidth}{.4\textwidth}
      \scalebox{\tikzscale}{
        \begin{tikzpicture}
          \begin{axis}[ymin=0.25, ymax=0.85,
            grid=both, minor tick num=2,
            axis lines=center,
            xlabel=epoch,
            x label style={anchor=north east}]
            \addCSVplot[blue, name path=A1, smooth]            {\defaultdir/data-iseg/model-AdaptSegNet/unsup-True/src-T1/targ-T2/lambda-\defadaptlambda/loss-None/shuf-False/eval-True}{TARG_DC}{amin};
            \addlegendentry{AdaptSegNet};
            \addCSVplot[red, name path=A2, smooth]            {\defaultdir/data-iseg/model-DDM/unsup-True/src-T1/targ-T2/lambda-\deflambda/loss-\defloss/shuf-False/eval-True}{TARG_DC}{amin};
            \addlegendentry{Proposed};
            \addCSVplot[blue, name path=B1,smooth]            {\defaultdir/data-iseg/model-AdaptSegNet/unsup-True/src-T1/targ-T2/lambda-\defadaptlambda/loss-None/shuf-False/eval-True}{TARG_DC}{amax};
            \addCSVplot[red, name path=B2, smooth]            {\defaultdir/data-iseg/model-DDM/unsup-True/src-T1/targ-T2/lambda-\deflambda/loss-\defloss/shuf-False/eval-True}{TARG_DC}{amax};
            \tikzfillbetween[of=A1 and B1]{blue, opacity=0.2};
            \tikzfillbetween[of=A2 and B2]{red, opacity=0.2};
          \end{axis}
        \end{tikzpicture}}
    \end{scaletikzpicturetowidth}
    \label{fig:dice-iseg}
  }
  \caption{Evolution of mean Dice coefficient over epochs. The minimum and maximum observed value over the three cross-validation runs is plotted, and the area in between is shaded.}
  \label{fig:dice}
\end{figure}







{\bfseries{Kernel choice:}} In addition to \ac{kl} divergence, we conducted experiments with the squared Euclidean distance $\mathcal D(\mathbf s, \mathbf s') = \|{\mathbf s} - {\mathbf s}'\|^2$ and the negative Bhattacharyya kernel $\mathcal D(\mathbf s, \mathbf s') = -\sqrt{{\mathbf s}^T {\mathbf s}'}$, on both datasets. As shown in Table~\ref{table:bhattacharyya}, the kernel choice has a negligible impact on the performances.
\begin{table}[h!]
\small
\caption{Mean Dice coefficients in percent when training with different distance functions $\mathcal D$.}
\begin{tabular}{CCCZ}
\toprule
  $\mathcal D$ & Sq.\ Euclidean & Bhattacharyya & \ac{kl} divergence \\
  $\lambda$ & $\altdistlambda$ & $\altdistlambda$ & $\deflambda$ \\
      \midrule
  iSEG, T1 $\to$ T2
  \printSingleRow{%
  data-iseg/model-DDM/unsup-True/src-T1/targ-T2/lambda-\altdistlambda/loss-mean_squared_error/shuf-False/eval-True,
  data-iseg/model-DDM/unsup-True/src-T1/targ-T2/lambda-\altdistlambda/loss-negative_bhattacharyya_kernel/shuf-False/eval-True,
  data-iseg/model-DDM/unsup-True/src-T1/targ-T2/lambda-\deflambda/loss-kullback_leibler_divergence/shuf-False/eval-True}[][][EVAL_]
  \\
    MRBrainS, T1 $\to$ T2
  \printSingleRow{%
  data-mrbrains13/model-DDM/unsup-True/src-T1/targ-T2_FLAIR/lambda-\altdistlambda/loss-mean_squared_error/shuf-False/eval-False,
  data-mrbrains13/model-DDM/unsup-True/src-T1/targ-T2_FLAIR/lambda-\altdistlambda/loss-negative_bhattacharyya_kernel/shuf-False/eval-False,
  data-mrbrains13/model-DDM/unsup-True/src-T1/targ-T2_FLAIR/lambda-\deflambda/loss-kullback_leibler_divergence/shuf-False/eval-False}
  \\
\bottomrule
\end{tabular}
\label{table:bhattacharyya}
\end{table}

{\bfseries{Impact of parameter $\lambda$:}}
We experimented with different value of parameter $\lambda$ to examine the sensitivity of the method with respect to the choice of this parameter.
The results are reported in Table~\ref{table:lambda}.

\begin{table*}[h!]
  \def\mylambdai{1.0}
  \def\mylambdaii{0.1}
  \def\mylambdaiii{0.05}
  \def\mylambdaiv{0.01}
  \def\mylambdav{0.005}
  \def\mylambdavi{0.001}
\small
\caption{Mean Dice coefficients in percent when training with different Lagrange parameters.}
\makebox[\textwidth][c]{
\begin{tabular}{CCCCCZ}
\toprule
  $\lambda$   & \mylambdaii
  & \mylambdaiii & \mylambdaiv  & \mylambdav & \mylambdavi \\
      \midrule
  iSEG, T1 $\to$ T2
  \printSingleRow{%
    data-iseg/model-DDM/unsup-True/src-T1/targ-T2/lambda-\mylambdaii/loss-\defloss/shuf-False/eval-True,
  data-iseg/model-DDM/unsup-True/src-T1/targ-T2/lambda-\mylambdaiii/loss-\defloss/shuf-False/eval-True,
  data-iseg/model-DDM/unsup-True/src-T1/targ-T2/lambda-\mylambdaiv/loss-\defloss/shuf-False/eval-True,
  data-iseg/model-DDM/unsup-True/src-T1/targ-T2/lambda-\mylambdav/loss-\defloss/shuf-False/eval-True,
  data-iseg/model-DDM/unsup-True/src-T1/targ-T2/lambda-\mylambdavi/loss-\defloss/shuf-False/eval-True}[][][EVAL_]
  \\
    iSEG, T2 $\to$ T1
  \printSingleRow{%
    data-iseg/model-DDM/unsup-True/src-T2/targ-T1/lambda-\mylambdaii/loss-\defloss/shuf-False/eval-True,
  data-iseg/model-DDM/unsup-True/src-T2/targ-T1/lambda-\mylambdaiii/loss-\defloss/shuf-False/eval-True,
  data-iseg/model-DDM/unsup-True/src-T2/targ-T1/lambda-\mylambdaiv/loss-\defloss/shuf-False/eval-True,
  data-iseg/model-DDM/unsup-True/src-T2/targ-T1/lambda-\mylambdav/loss-\defloss/shuf-False/eval-True,
  data-iseg/model-DDM/unsup-True/src-T2/targ-T1/lambda-\mylambdavi/loss-\defloss/shuf-False/eval-True}[][][EVAL_]
  \\
    MRBrainS, T1 $\to$ T2
  \printSingleRow{%
    data-mrbrains13/model-DDM/unsup-True/src-T1/targ-T2_FLAIR/lambda-\mylambdaii/loss-\defloss/shuf-False/eval-False,
  data-mrbrains13/model-DDM/unsup-True/src-T1/targ-T2_FLAIR/lambda-\mylambdaiii/loss-\defloss/shuf-False/eval-False,
  data-mrbrains13/model-DDM/unsup-True/src-T1/targ-T2_FLAIR/lambda-\mylambdaiv/loss-\defloss/shuf-False/eval-False,
  data-mrbrains13/model-DDM/unsup-True/src-T1/targ-T2_FLAIR/lambda-\mylambdav/loss-\defloss/shuf-False/eval-False,
  data-mrbrains13/model-DDM/unsup-True/src-T1/targ-T2_FLAIR/lambda-\mylambdavi/loss-\defloss/shuf-False/eval-False}[]
  \\
    MRBrainS, T2 $\to$ T1
  \printSingleRow{%
    data-mrbrains13/model-DDM/unsup-True/src-T2_FLAIR/targ-T1/lambda-\mylambdaii/loss-\defloss/shuf-False/eval-False,
  data-mrbrains13/model-DDM/unsup-True/src-T2_FLAIR/targ-T1/lambda-\mylambdaiii/loss-\defloss/shuf-False/eval-False,
  data-mrbrains13/model-DDM/unsup-True/src-T2_FLAIR/targ-T1/lambda-\mylambdaiv/loss-\defloss/shuf-False/eval-False,
  data-mrbrains13/model-DDM/unsup-True/src-T2_FLAIR/targ-T1/lambda-\mylambdav/loss-\defloss/shuf-False/eval-False,
  data-mrbrains13/model-DDM/unsup-True/src-T2_FLAIR/targ-T1/lambda-\mylambdavi/loss-\defloss/shuf-False/eval-False}[]
  \\
\bottomrule
\end{tabular}
}
\label{table:lambda}
\end{table*}









\section{Conclusions}

In this paper, we proposed a direct distribution matching approach for \ac{uda} in the context of semantic segmentation of medical images. Unlike adversarial approaches, our method matches the distributions from both domains with a single network, avoiding complex and unstable adversarial steps. It also leverages the contextual similarities of the output (label) spaces corresponding to pairs of images from different modalities but depicting the same structures, up to some geometric transformations, as is very common in medical imaging. Unlike natural images, this property is specific to multi-modal medical images and provides a very important structure prior for \ac{uda}. Adversarial approaches do not have a mechanism to account for such an important prior.  
As demonstrated in our experiments, directly matching output distributions has several benefits compared to adversarial learning: significantly superior performances and better training stability. 

\section*{Acknowledgments}
\label{sec:acknowledgments}

Dr.\ Georg Pichler and Prof.\ Pablo Piantanida would like to acknowledge support for this project from the CNRS via the International Associated Laboratory (LIA) on Information, Learning and Control. The work of Prof.\ Pablo Piantanida was supported by the European Commission’s Marie Sklodowska-Curie Actions (MSCA), through the Marie Sklodowska-Curie IF (H2020-MSCAIF-2017-EF-797805-STRUDEL). 

Prof. Jose Dolz would like to thank NVIDIA for the donation of one TITAN V to support his research.

Some computations were made on the supercomputer ``Helios'' from Laval University, managed by Calcul Québec and Compute Canada. The operation of this supercomputer is funded by the Canada Foundation for Innovation (CFI), the ministère de l'Économie, de la science et de l'innovation du Québec (MESI) and the Fonds de recherche du Québec - Nature et technologies (FRQ-NT).

Part of this work was performed using HPC resources from the \href{http://mesocentre.centralesupelec.fr/}{Mésocentre} computing center of CentraleSupélec and École Normale Supérieure Paris-Saclay supported by CNRS and Région Île-de-France.

\myappendix

{\small
\bibliography{pichler20}
}

\end{document}

