% \vspace{-3pt}
\vspace{-10pt}
\section{Experiments}
% \vspace{-1pt}
\vspace{-2pt}
\subsection{Experiment setup} \label{exp:setup}
% \vspace{-1pt}
\vspace{-2pt}

In our experiments, we emulate a distributed setting with $N$ sites. As per \sectionref{met:overview}, for each site $S_i$, GAN-$S_i$ and U-Net-\textit{real-}$S_i$ are trained and used to generate a synthetic dataset. U-Net-\textit{syn-all} is trained on the merged datasets and fine-tuned at each site to get U-Net-\textit{syn-real-}$S_i$. U-Net-\textit{real-all} is trained on merged real data as a baseline. All the experiments are performed via five-fold cross-validation (3 folds for training, 1 fold for validation and test each). The mean and the standard deviation of the Dice Score (DS) and the 95th percentile of the Hausdorff Distance (HD95) across the folds are reported. All the evaluations were performed on real data. The results of statistical testing are given in Appendix~\ref{app:stats}. Appendix \ref{app:fedlearn} contains comparisons with federated learning baselines. The ablations of pretraining with single-site synthetic data and of using the standard StyleGAN2 architecture are reported in Appendices~\ref{app:syn-local}, \ref{app:square-gan}.

\subsection{Datasets} \label{exp:data}
\vspace{-4pt}
\textbf{Cervix}: a private dataset from the Leiden Univercity Medical Center consisting of T2-weighted MRI scans of 185 cervical cancer patients who underwent brachytherapy, with 4 organs-at-risk (bladder, bowel, rectum, sigmoid) delineated. The dataset was split into two sites based on the scanner to emulate a non-i.i.d. data distribution: site A (Philips Ingenia 1.5T (128~patients)), and site B (Philips Intera (36 patients), Ingenia 3T (13), Achieva (8)). The median resolution is $37\times432\times432$ voxels, the median spacing is $4\times0.53\times0.53$ mm.

\vspace{2pt}

\noindent \textbf{Lung}: QaTa-COV19~\cite{degerli_osegnet_2022}, a dataset of COVID-19 chest X-ray images and binary segmentation masks of pneumonia. We use 6,307 images for which an anonymized patient ID is provided (2,130 patients) and randomly split patients into 2 or 8 sites. The median resolution is $224\times224$ pixels.

\vspace{2pt}

\noindent \textbf{Polyp}: polyp photos with binary segmentation masks of polyps, site A contains data from HyperKvasir~\cite{borgli_hyperkvasir_2020} (1000 images, median resolution $530\times621$ pixels), site B contains data from CVC-ClinicDB~\cite{bernal_wm-dova_2015} (612 images, $384\times288$ pixels).

\vspace{-5pt}
\subsection{Results}
\vspace{-2pt}
\subsubsection{Synthetic data sharing leads to improved segmentation quality}
\vspace{-2pt}
\figureref{fig:cervix} shows that in the experiments with \textbf{Cervix}, DS and HD95 metrics improve in most cases. The performance on site A is approximately the same across the \emph{real-A}, \emph{syn-real-A}, \emph{real-all} settings, showing that adding data from site B is not very helpful even if it is real data. This is likely due to the large number and uniformity of patients in A itself. Nonetheless, the performance in the \emph{syn-real-A} setting on site B is improved compared to \emph{real-A} (by 3.3 DS and 4.3 HD95 on average), showing that pretraining on merged synthetic data improves the robustness of the model to data shifts. For site B, the pretrained and fine-tuned model (\emph{syn-real-B}) outperforms its counterpart trained exclusively on the local data (\emph{real-B}) when tested on both A (by 6.3 DS and 4.3 HD95 on average) and B (by 2.2 DS and 2.5 HD95 on average). The full results for all datasets are given in Appendix~\ref{app:full}.

\begin{figure}
    \centering
    \includegraphics[width=0.34\linewidth]{_figures/4_experiments/cervix_errorbar_dice.pdf}%
    \hspace{.05\textwidth}%
    \includegraphics[width=0.34\linewidth]{_figures/4_experiments/cervix_errorbar_hd95.pdf}
    \vspace{-6pt}
    \caption{Results for the \textbf{Cervix} data: U-Nets trained with the settings specified in \sectionref{exp:setup} and evaluated on sites A and B (5 folds).}
    \label{fig:cervix}
    \vspace{-25pt}
\end{figure}

It can be seen in \figureref{fig:qata_and_hyperkvasir} that switching from the model trained only on local data (\emph{real}) to the one pretrained on the merged synthetic data and fine-tuned on local data (\emph{syn-real}) leads to improvements in the \textbf{Lung} setting of 0.8 DS on average. For \textbf{Polyp}, the improvement for the target site is minor (0.5 DS on average), but the improvement in robustness to data shifts, as measured by the performance on the other site, is large (on average, 2.7 DS for A and 13.4 DS for B).

While training on real data centrally (\textit{real-all}) gives the best results overall, our method comes close (the largest difference is 2.0 DS in \textbf{Polyp}-B) without requiring real data sharing.

\begin{figure}[ht]
    \centering
    \begin{minipage}{.6\textwidth} %.632
        \centering
        \begin{minipage}{.5\textwidth}
            \includegraphics[width=\linewidth]{_figures/4_experiments/qata_errorbar_dice.pdf}
            \label{fig:lung_dice}
        \end{minipage}%
        \begin{minipage}{.5\textwidth}
            \includegraphics[width=\linewidth]{_figures/4_experiments/polyp_errorbar_dice.pdf}
            \label{fig:placeholder}
        \end{minipage}
        \caption{DS for the \textbf{Lung} (left) and \textbf{Polyp} (right) data: U-Nets trained in the settings specified in \sectionref{exp:setup} and evaluated on sites A and B (5 folds).}
        \label{fig:qata_and_hyperkvasir}
    \end{minipage}%
    \hspace{.08\textwidth}%
    \begin{minipage}{.316\textwidth} %.316 %.372
        \centering
        \includegraphics[width=.93\linewidth]{_figures/4_experiments/qata_scaling_dice.pdf}
        \caption{DS improvement of \emph{syn-real} over \emph{real} as more sites are added (\textbf{Lung}).}
        \label{fig:deltas}
        
    \end{minipage}
    \vspace{-15pt}
\end{figure}

\subsubsection{Benefits of synthetic data sharing increase with more sites}
\vspace{-1pt}

The scaling behavior of HyFree-S3 is investigated in the \textbf{Lung} setting, with the data split into 8 sites. The general segmentation model is pretrained with the synthetic data from 2, 4, or 8 sites, and then fine-tuned at each site. The average difference in DS between \emph{real} (training with local data only) and \emph{syn-real}, shown in Figure~\ref{fig:deltas}, becomes larger as the number of site grows, showing that the method is scalable and enables larger improvements when more sites join.

\vspace{-5pt}
\subsubsection{Memorization is not observed} \label{exp:mem}
\vspace{-1pt}

Per \sectionref{met:mem}, we use OpenCLIP embeddings to compare either subsets of real images, or real images and synthetic images. As a sanity check, we established that an image present in two sets will be its own nearest neighbor, and that a mirrored image will most often be the nearest neighbor of the original image (in 97.9\% of cases for \textbf{Lung}).

Figure~\ref{fig:memorization} (top right) shows a histogram of distances to the nearest neighbor when comparing two subsets of real images (from a \textbf{Lung} experiment), as well as their 5$^\mathrm{th}$ percentile that will be used as a threshold to filter synthetic images. The histogram of distances between real and synthetic images in Figure~\ref{fig:memorization} (bottom right) has a similar shape but shifted to the right. As a result, all synthetic images have a distance above the threshold, meaning that while synthetic images are generally similar to real images, they are not too similar. 

\begin{figure}
    \centering
    \includegraphics[width=\linewidth]{_figures/4_experiments/fig_memorization5.png}
    \caption{\textbf{Left:} real images that are the closest to any synthetic image and the two closest synthetic images (including the distance to the real image). \textbf{Right:} a distribution of distances to the nearest neighbor for \textit{(top)} two subsets of real images or \textit{(bottom)} synthetic and real images, and the 5$^{\mathrm{th}}$ percentile of the real-to-real distances.}
    \label{fig:memorization}
    \vspace{-10pt}
\end{figure}

We visualize adversarially chosen real and synthetic images in Figure~\ref{fig:memorization} (left). We select two synthetic images (column 2) with the smallest nearest neighbor distances to some real images (column 1), and the synthetic images that have the second-smallest distances to these real images (column 3). The images are broadly similar but clearly distinct and do not demonstrate memorization (it should be noted that our memorization analysis relies on OpenCLIP embeddings, see Appendix \ref{app:memo} for further discussion of memorization). 

For \textbf{Lung} and \textbf{Polyp}, only 6 synthetic images (out of $3\times10^5$) are discarded as too similar to real images. For \textbf{Cervix}, 1,180 images (out of $2.6\times10^5$) are discarded, some of which look very similar to their real nearest neighbours but none are exact duplicates, which is consistent with the nature of 3D data where nearby slices are similar. See Appendix~\ref{app:viz_cmp_syn_real} for visual comparisons between real and synthetic images. Only $0.2\%$ of synthetic images being  discarded shows that our GANs typically do not memorize images but the proposed filtering scheme could nonetheless help to avoid sharing potentially memorized data.