%\section{Results}
\subsubsection{Data.}We validate the proposed framework on two widely studied medical datasets:
\begin{itemize}  
\item[$\boldsymbol{\cdot}$] Multimodal Brain Tumor Image Segmentation (BraTS) \cite{Menze2015TheMB} 
\item[$\boldsymbol{\cdot}$] International Skin Imaging Collaboration: Melanoma Project (ISIC)   \cite{Kuijf2019StandardizedAO}
\end{itemize} 

The BraTS2019 dataset consists of MRI scans of glioma patients with manual segmentation by an expert board. The total set comprises 335 MRI images, among which 259 are high-grade glioma patients and 76 are low-grade glioma (we sliced 16.5K 2D images out of the 3D volumes for our task). For each patient, scans from 4 imaging modalities are available: T2-weighted, T2 Fluid Attenuated Inversion Recovery (FLAIR), T1, and T1-weighted. The images annotations are the GD-enhancing tumor, the peritumoral edema, the necrotic and non-enhancing tumor core. The dataset was obtained from multiple medical centers and we observe notable difference in the appearance of the scans across the centers. Thus, we use the the center ID available from the naming of the data (e.g. CBICA, TMC, etc.) as the class conditioning. 

The ISIC2018 dataset contains over 13000 dermoscopic images of skin lesions. We used its subset of $\sim$2600 that is annotated by clinical experts to outline the lesion area and also includes metadata as the type of the lesion. For the class conditioning, we used the skin lesion type, e.g. Melanoma, Seborrheic Keratosis, Nevus.  

As depicted in Fig.3 (upper row), both datasets exhibit class imbalance, which we aim to eliminate by augmenting the datasets with class-specific synthetic samples.  

\subsubsection{Experiments.}First, we quantitatively show an effect of the third-player inclusion by training a 2D U-Net only on synthetic images produced by the original and proposed GANs while testing on real data from the original dataset, Tab. \ref{table:1}. With this we aim to probe the quality of the synthetic set. We observe an improvement of the accuracy (multi-class DICE score) for the proposed SPADE architecture with feature level discrimination compared to the original SPADE. In what follows we use the proposed version. Examples of the synthesized images can be found in the supplementary.

\begin{table}[H]
\centering
  \begin{tabular}{ |p{3cm}|p{3cm}|p{3cm}|  }
  \hline
   & Original SPADE & Proposed SPADE \\
  \hline
  BraTS & 0.6479 & 0.6598 \\ 
  \hline
  ISIC & 0.5936 & 0.6169 \\ 
  \hline
\end{tabular}

\caption{Dice scores for segmentation using only synthetic images produced by the orginal and proposed GANs. For the GAN training we used 90\% percent of the total amount of 2D slices/images. Testing is performed on 10\% of real data from original dataset. For the synthesis we used the mask from the training set. The results for the mean are obtained via 3 cross-fold validation.} \label{table:1}
\end{table}

Next, we perform two series of experiments in which we train a 2D U-Net on original dataset augmented with synthetic samples according to the following strategies:\\

\noindent I. \textbf{Single class augmentation.} We inject synthetic images that are generated for a particular class from all masks in the training set except the ones belonging to the class. \\
II. \textbf{Balanced augmentation.} We separately synthesize images that are generated according to (I) for all classes and inject them in the dataset. \\

The first strategy biases the class distribution with respect to the synthesised class making its number of samples equal to the size of the whole original dataset. This strategy should allow us to probe how specific the synthetic images to the class on which they were conditioned. The second makes the distribution balanced by increasing the size of each class to the original dataset size. In Fig. 3, we compare both of them with a baseline that is trained without the GAN-based augmentation. We observe that by using the strategy (I), the strongest accuracy increase compared to the baseline is achieved for the injected class (5\% for the BraTS "TCIA05" class and 4\% for the ISIC "Keratosis" class). For other classes there is a smaller increase or even decrease of the Dice score. This suggests that the generated images posses the desired property of being specific to the conditioned class. Plots for other single class injection experiments are provided in the supplementary. 

As depicted in the bottom row, by using the balanced augmentation (strategy II), we can achieve increase of the DICE score for most of the sparse classes (up to 5\% for BraTS and up to 2\% percent for ISIC). We note that
difference in visual appearance between various skin lesions is clearly greater compared to the difference between the MRI images of the same brain tumor lesion acquired from the varying acquisition environment. Thus, to learn the mask-image mapping conditioned on the lesion type is a more difficult task compared to learning the mapping conditioned on the varying center ID. This explains poorer performance of the method on the ISIC dataset compared to BraTS. 

%Rather poor increase for the Melanoma lesion (ISIC) can be explained by its very similar appearance to the dominant Melanoma class. %We note that due to 8 times smaller size of the ISIC dataset compared to BraTS the quality of the generated images is inferior. Despite of this, this quality is sufficient for changing the accuracy levels.




















%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%



\begin{comment}

in order to establish a baseline for segmentation results, we train the 2D U-Net using original images and also original images augmented through traditional geometric transformations. We test on data from the original dataset. Tab. \ref{table:2} shows that even augmenting through geometric transformations can give a significant boost in accuracy.

\begin{table}[H]
 \centering
 \begin{tabular}{ |p{3cm}|p{3cm}|p{3cm}  }
 \hline
 Original Images & Augmented images \\
 \hline\hline
 0.7554 (0.0984) & 0.8243 (0.0893) \\ 
 \hline
 \end{tabular}
 \centering
 
 \caption{Dice scores for segmentation of the BraTS dataset using only original images without any augmentation and augmentations based on geometrical transformations, e.g. rotation, flipping. The results for the mean and standard deviation (in brackets) are obtained via 3 cross-fold validation} \label{table:2}
\end{table}

Lastly, using the proposed GAN we generate synthetic images under different conditioning strategies and add them to the training images from the original dataset. Tab. \ref{table:3} shows the results for training a U-Net using the GAN-based augmentations. We add different proportions of synthetic images compared to the size of original dataset. The accuracy gains plateau at 3x the size of original dataset. The different conditioning strategies used on masks are elastic deformation (ED), co-registration (CR) and a hybrid of co-registration with geometric transformations (CR + Augmented). 

\begin{table}[H]
\centering
  \begin{tabular}{ |p{2.2cm}||p{2.2cm}|p{2.2cm}|p{2.2cm}|p{2.2cm}|p{2.2cm}|  }
  \hline
  Increase of original set by synthetic images & Synthetic (Original Masks) & Synthetic (ED) & Synthetic (CR) & Synthetic (CR) + Augmented \\
  \hline\hline
  1.5x & 0.7434 (0.1353) & 0.7518 (0.1001) & 0.7658 (0.1247) & 0.8245 (0.0751)\\ 
  %\hline
  %100\% & 0.7567 & 0.7832 & 0.7739 & 0.8281 \\
  \hline
  3x & - & 0.7683 (0.1058) & 0.7823 (0.0964) & 0.8341 (0.0759) \\
  \hline
  \end{tabular}
  
  \caption{Dice scores for segmentation of the BraTS dataset using GAN-based augmentations. The results for the mean and standard deviation (in brackets) are obtained via 3 cross-fold validation} \label{table:3}
\end{table}

Tab. \ref{table:4}, shows results for class-specific accuracy scores for different augmentation strategies.

\begin{table}[H]
 \begin{tabular}{ |p{1.92cm}||p{1.92cm}|p{1.92cm}|p{1.92cm}|p{1.92cm}|p{1.92cm}|p{1.92cm}| }
 \hline
 Class & Original Images & Synthetic (ED), 3x & Synthetic (CR), 3x & Augmented Images & Synthetic (CR) + Augmented, 3x \\
 \hline\hline
 WT & 0.7554 (0.0984) & 0.7683 (0.1058) & 0.7823 (0.0964) & 0.8243 (0.0893) & 0.8341 (0.0759) \\ 
 \hline
 TC & 0.5880 (0.2731) & 0.6395 (0.2474) & 0.6437 (0.2478) & 0.7319 (0.2234) & 0.7360 (0.2401)  \\
 \hline
 ET & 0.6902 (0.2613) & 0.7042 (0.2338) & 0.6958 (0.2464) & 0.7645 (0.2157) & 0.7644 (0.2216) \\
 \hline
\end{tabular}

\caption{Class specific Dice for the BraTS dataset. The results for the mean and standard deviation (in brackets) are obtained via 3 cross-fold validation} \label{table:4}
\end{table}

We performed a Wilcoxon Sign Test on the pair-wise dice scores obtained from Augmented Images and Synthetic (CR) + Augmented Images, conditioning strategies. For the WT class, the Wilcoxon value is 65938.0 (p-value: 0.0149). For ET, it is 33273.5 (0.6120) while it is 33534.0 (p-value: 0.0688) for the TC class. 

\end{comment}
