\appendix

\section{Statistical testing} \label{app:stats}

Table~\ref{tab:pvalues} lists p-values for one-sided Wilcoxon pairwise rank tests~\cite{wilcoxon_individual_1992} with Bonferroni correction~\cite{dunn_multiple_1961}. For each dataset, we compare metrics of \textit{syn-real} (pretraining with merged synthetic data followed by site-specific fine-tuning on real data) against that of \textit{real} (training with local real data only) with the null hypothesis that \textit{syn-real} performs worse. The performances of \textit{syn-real-A} (evaluated on A, B) and \textit{syn-real-B} (evaluated on A, B) across 5 folds are pulled together for increased sample size (and similarly for \textit{real-A} and \textit{real-B}). The total sample size is 20 for each test. P-values below the significance threshold of 0.0025 are highlighted (target p-value$=0.01$, 4 tests, corrected $p=0.0025$). In all cases the null hypothesis is rejected, thus we can conclude that \textit{syn-real} performs statistically significantly better than \textit{real}.

\begin{table}[h!]
\centering
  \caption{P-values of conducted experiments (rounded to four decimal places).}
  \label{tab:pvalues}
  \begin{tabular}{ccl}
    \toprule
    Dataset & Metric & P-value\\
    \midrule
    \textbf{Cervix} & DS & \cellcolor{mycolor} 0.0006 \\
    \textbf{Cervix} & HD95 & \cellcolor{mycolor} 0.0007 \\
    \textbf{Lung} & DS & \cellcolor{mycolor} 0.0002 \\
    \textbf{Polyp} & DS & \cellcolor{mycolor} 0.0002 \\
  \bottomrule
\end{tabular}
\end{table}

\section{Full results} \label{app:full}

Tables~\ref{tab:metrics_cervix}, \ref{tab:metrics_lung}, \ref{tab:metrics_polyp} give full results for our \textbf{Cervix}, \textbf{Lung}, \textbf{Polyp} experiments. For \textbf{Cervix} and \textbf{Lung}, U-Net-syn-$S_i$ are additionally trained on local synthetic datasets to compare with U-Net-syn-all. As expected, they perform worse.

Table~\ref{tab:cervix_per_organ} gives per-organ performance for the \textbf{Cervix} dataset. Figure~\ref{fig:cmp_slices} contains slice predictions for qualitative comparison of a \textit{real} and a \textit{syn-real} models.

\begin{table}[h!]
    \fontsize{8pt}{8pt}\selectfont
    \centering
    \caption{Results for the \textbf{Cervix} data. Each column corresponds to a U-Net trained in the specified setting (see~\sectionref{exp:setup}) and evaluated on sites A and B. Mean $\pm$ st. dev. are reported for 5 folds.}
    \begin{NiceTabular}{@{}l|@{}c@{\hspace{2pt}}|cc|ccc|ccc@{}} \toprule
    \Block{2-1}{Metric} & \Block{2-1}{\hspace{2pt}Test\\\hspace{1pt}site} & \Block{1-8}{Training setting} \\
    \cmidrule{3-10}
    & & real-all & syn-all & real A & syn A & syn-real A & real B & syn B & syn-real B \\\midrule
    \Block{2-1}{DS} & A & $80.7 \pm 1.2$ & $79.4 \pm 1.3$ & $80.1 \pm 1.9$ & $79.0 \pm 1.5$ & $80.2 \pm 1.7$ & $72.2 \pm 3.9$ & $72.4 \pm 3.8$ & $78.5 \pm 2.1$ \\
    & B & $77.6 \pm 2.2$ & $75.4 \pm 2.6$ & $70.8 \pm 2.5$  & $69.9 \pm 1.5$ & $74.1 \pm 1.8$ & $73.7 \pm 4.4$ & $72.8 \pm 3.9$ & $75.9 \pm 1.6$ \\\midrule
    \Block{2-1}{HD95} & A & $13.4 \pm 0.5$ & $14.8 \pm 1.1$ & $14.2 \pm 0.9$ & $15.3 \pm 1.2$ & $14.3 \pm 0.5$ & $19.4 \pm 3.3$ & $19.9 \pm 3.1$ & $15.1 \pm 2.1$ \\
    & B & $16.0 \pm 1.0$ & $16.1 \pm 2.7$ & $21.4 \pm 4.3$ & $21.0 \pm 3.9$ & $17.1 \pm 1.1$ & $18.7 \pm 2.8$ & $19.0 \pm 3.0$ & $16.2 \pm 2.4$ \\
    
    \bottomrule
    
    \end{NiceTabular}
    \label{tab:metrics_cervix}
\end{table}

\begin{table}[h!]
    \fontsize{8pt}{8pt}\selectfont
    \centering
    \caption{Results for the \textbf{Lung} data. Each column corresponds to a U-Net trained in the specified setting (see~\sectionref{exp:setup}) and evaluated on sites A and B. Mean $\pm$ st. dev. are reported for 5 folds.}
    \begin{NiceTabular}{@{}l|@{}c@{\hspace{2pt}}|cc|ccc|ccc@{}} \toprule
    \Block{2-1}{Metric} & \Block{2-1}{\hspace{2pt}Test\\\hspace{1pt}site} & \Block{1-8}{Training setting} \\
    \cmidrule{3-10}
    & & real-all & syn-all & real A & syn A & syn-real A & real B & syn B & syn-real B \\\midrule
    \Block{2-1}{DS} & A & $77.2 \pm 0.9$ & $75.8 \pm 1.1$ & $75.9 \pm 1.0$ & $74.7 \pm 1.3$ & $76.5 \pm 1.6$ & $75.6 \pm 1.3$ & $74.6 \pm 1.5$ & $76.3 \pm 1.7$ \\
    & B & $78.2 \pm 1.6$ & $76.7 \pm 1.3$ & $76.7 \pm 1.4$ & $75.3 \pm 1.7$ & $77.5 \pm 1.2$ & $76.2 \pm 1.5$ & $75.0 \pm 1.6$ & $77.3 \pm 1.4$ \\
    \bottomrule
    
    \end{NiceTabular}
    \label{tab:metrics_lung}
\end{table}

\begin{table}[h!]
    \fontsize{8pt}{8pt}\selectfont
    \centering
    \caption{Results for the \textbf{Polyp} data. Each column corresponds to a U-Net trained in the specified setting (see~\sectionref{exp:setup}) and evaluated on sites A and B. Mean $\pm$ st. dev. are reported for 5 folds.}
    \begin{NiceTabular}{@{}l|@{}c@{\hspace{2pt}}|cc|cc|cc@{}} \toprule
    \Block{2-1}{Metric} & \Block{2-1}{\hspace{2pt}Test\\\hspace{1pt}site} & \Block{1-6}{Training setting} \\
    \cmidrule{3-8}
    & & real-all & syn-all & real A & syn-real A & real B & syn-real B \\\midrule
    \Block{2-1}{DS} & A & $90.0 \pm 1.4$ & $87.7 \pm 1.4$ & $89.7 \pm 1.4$ & $90.3 \pm 1.4$ & $69.3 \pm 2.5$ & $82.7 \pm 3.4$ \\
    & B & $84.2 \pm 5.1$ & $80.9 \pm 5.7$ & $78.4 \pm 8.2$ & $81.1 \pm 7.4$ & $81.8 \pm 4.6$ & $82.2 \pm 3.9$ \\
    \bottomrule
    
    \end{NiceTabular}
    \label{tab:metrics_polyp}
\end{table}

\begin{table}[h!]
    % \fontsize{8pt}{10pt}\selectfont
    \centering
    \begin{NiceTabular}{l|cc|cc|cc|cc|cc} \toprule
    \Block{2-1}{Model} & \Block{1-2}{Avg.} & & \Block{1-2}{Bladder} & & \Block{1-2}{Bowel} && \Block{1-2}{Rectum} && \Block{1-2}{Sigmoid}\\
    \cmidrule{2-11}& A & B & A & B & A & B & A & B & A & B \\\hline\addlinespace[4pt]
    \Block{1-11}{Dice Score} \\\hline
real-A & $80.1$ & $70.8$ & $96.0$ & $91.9$ & $68.8$ & $56.0$ & $81.7$ & $76.8$ & $73.7$ & $58.7$ \\
  & $\pm 1.9$ & $\pm 2.5$ & $\pm 0.4$ & $\pm 2.5$ & $\pm 3.5$ & $\pm 6.4$ & $\pm 2.9$ & $\pm 3.3$ & $\pm 2.5$ & $\pm 3.4$ \\\addlinespace[2pt]
syn-A & $79.0$ & $69.9$ & $96.0$ & $91.7$ & $66.6$ & $53.2$ & $81.7$ & $76.5$ & $71.7$ & $58.2$ \\
  & $\pm 1.5$ & $\pm 1.5$ & $\pm 0.4$ & $\pm 2.5$ & $\pm 2.8$ & $\pm 4.9$ & $\pm 2.8$ & $\pm 2.8$ & $\pm 2.0$ & $\pm 2.0$ \\\addlinespace[2pt]
syn-real-A & $80.2$ & $74.1$ & $96.0$ & $93.3$ & $69.6$ & $61.4$ & $81.5$ & $79.3$ & $73.5$ & $62.6$ \\
  & $\pm 1.7$ & $\pm 1.8$ & $\pm 0.4$ & $\pm 1.3$ & $\pm 2.8$ & $\pm 6.3$ & $\pm 2.8$ & $\pm 3.6$ & $\pm 2.0$ & $\pm 3.1$ \\\addlinespace[2pt]
real-B & $72.2$ & $73.7$ & $95.3$ & $94.1$ & $53.2$ & $60.3$ & $77.4$ & $79.1$ & $62.8$ & $61.5$ \\
  & $\pm 3.9$ & $\pm 4.4$ & $\pm 0.6$ & $\pm 0.9$ & $\pm 8.5$ & $\pm 9.9$ & $\pm 3.1$ & $\pm 3.2$ & $\pm 6.4$ & $\pm 7.0$ \\\addlinespace[2pt]
syn-B & $72.4$ & $72.8$ & $95.4$ & $93.9$ & $53.4$ & $57.7$ & $77.5$ & $78.8$ & $63.3$ & $60.8$ \\
  & $\pm 3.8$ & $\pm 3.9$ & $\pm 0.5$ & $\pm 1.2$ & $\pm 8.9$ & $\pm 11.0$ & $\pm 2.8$ & $\pm 3.5$ & $\pm 6.5$ & $\pm 3.8$ \\\addlinespace[2pt]
syn-real-B & $78.5$ & $75.9$ & $95.8$ & $93.9$ & $65.6$ & $63.4$ & $81.5$ & $80.5$ & $71.0$ & $66.0$ \\
  & $\pm 2.1$ & $\pm 1.6$ & $\pm 0.4$ & $\pm 1.2$ & $\pm 3.9$ & $\pm 5.6$ & $\pm 2.7$ & $\pm 3.3$ & $\pm 3.7$ & $\pm 3.7$ \\\addlinespace[2pt]
real-all & $80.7$ & $77.6$ & $96.2$ & $94.4$ & $70.1$ & $65.5$ & $81.8$ & $82.4$ & $74.7$ & $68.0$ \\
  & $\pm 1.2$ & $\pm 2.2$ & $\pm 0.4$ & $\pm 1.0$ & $\pm 2.1$ & $\pm 6.5$ & $\pm 2.6$ & $\pm 2.5$ & $\pm 1.6$ & $\pm 3.0$ \\\addlinespace[2pt]
syn-all & $79.4$ & $75.4$ & $96.1$ & $94.0$ & $67.5$ & $61.9$ & $82.0$ & $81.0$ & $72.1$ & $64.9$ \\
  & $\pm 1.3$ & $\pm 2.6$ & $\pm 0.3$ & $\pm 1.2$ & $\pm 2.2$ & $\pm 7.9$ & $\pm 2.8$ & $\pm 3.2$ & $\pm 2.0$ & $\pm 3.8$ \\
    \hline\addlinespace[4pt]
    \Block{1-11}{Hausdorff Distance (95th percentile)} \\\hline
real-A & $14.2$ & $21.4$ & $3.6$ & $11.7$ & $20.3$ & $25.5$ & $13.5$ & $16.4$ & $19.6$ & $32.1$ \\
  & $\pm 0.9$ & $\pm 4.3$ & $\pm 1.0$ & $\pm 8.4$ & $\pm 1.8$ & $\pm 2.7$ & $\pm 2.5$ & $\pm 4.9$ & $\pm 3.3$ & $\pm 7.8$ \\\addlinespace[2pt]
syn-A & $15.3$ & $21.0$ & $3.8$ & $13.9$ & $22.0$ & $26.6$ & $13.6$ & $14.1$ & $21.9$ & $29.2$ \\
  & $\pm 1.2$ & $\pm 3.9$ & $\pm 1.0$ & $\pm 9.9$ & $\pm 3.4$ & $\pm 4.8$ & $\pm 2.1$ & $\pm 3.2$ & $\pm 3.8$ & $\pm 6.7$ \\\addlinespace[2pt]
syn-real-A & $14.3$ & $17.1$ & $3.7$ & $7.9$ & $18.7$ & $20.6$ & $13.8$ & $12.7$ & $20.9$ & $27.1$ \\
  & $\pm 0.5$ & $\pm 1.1$ & $\pm 1.3$ & $\pm 4.3$ & $\pm 3.6$ & $\pm 2.4$ & $\pm 2.3$ & $\pm 2.3$ & $\pm 2.2$ & $\pm 5.0$ \\\addlinespace[2pt]
real-B & $19.4$ & $18.7$ & $3.9$ & $5.8$ & $28.7$ & $24.6$ & $17.0$ & $15.2$ & $27.8$ & $29.3$ \\
  & $\pm 3.3$ & $\pm 2.8$ & $\pm 0.5$ & $\pm 2.2$ & $\pm 6.6$ & $\pm 6.1$ & $\pm 3.3$ & $\pm 2.5$ & $\pm 5.6$ & $\pm 4.7$ \\\addlinespace[2pt]
syn-B & $19.9$ & $19.0$ & $4.4$ & $5.9$ & $28.5$ & $26.7$ & $17.4$ & $14.4$ & $29.2$ & $29.2$ \\
  & $\pm 3.1$ & $\pm 3.0$ & $\pm 1.1$ & $\pm 2.5$ & $\pm 7.6$ & $\pm 8.1$ & $\pm 3.8$ & $\pm 2.5$ & $\pm 5.1$ & $\pm 4.2$ \\\addlinespace[2pt]
syn-real-B & $15.1$ & $16.2$ & $4.1$ & $6.1$ & $22.8$ & $21.2$ & $13.5$ & $13.3$ & $20.2$ & $24.4$ \\
  & $\pm 2.1$ & $\pm 2.4$ & $\pm 0.8$ & $\pm 2.8$ & $\pm 5.8$ & $\pm 3.3$ & $\pm 1.7$ & $\pm 2.4$ & $\pm 5.0$ & $\pm 6.8$ \\\addlinespace[2pt]
real-all & $13.4$ & $16.0$ & $2.9$ & $5.4$ & $18.8$ & $20.9$ & $13.4$ & $12.3$ & $18.5$ & $25.5$ \\
  & $\pm 0.5$ & $\pm 1.0$ & $\pm 0.4$ & $\pm 2.1$ & $\pm 3.5$ & $\pm 3.2$ & $\pm 2.0$ & $\pm 2.6$ & $\pm 3.0$ & $\pm 3.1$ \\\addlinespace[2pt]
syn-all & $14.8$ & $16.1$ & $3.9$ & $5.5$ & $20.9$ & $20.8$ & $13.7$ & $12.4$ & $20.6$ & $25.5$ \\
  & $\pm 1.1$ & $\pm 2.7$ & $\pm 1.1$ & $\pm 2.4$ & $\pm 3.7$ & $\pm 3.4$ & $\pm 2.0$ & $\pm 2.2$ & $\pm 4.3$ & $\pm 7.6$ \\
  \bottomrule
    \end{NiceTabular}
    \caption{Per-organ metrics for the \textbf{Cervix} data. Each row corresponds to a U-Net trained in the specified setting (see~\sectionref{exp:setup}) and evaluated on sites A and B. Mean $\pm$ st. dev. are reported for 5 folds.}
    \label{tab:cervix_per_organ}

\end{table}

\begin{figure}[h]
    \centering
    \includegraphics[width=0.8\linewidth]{_figures/7_appendix/midl_cmp_slices.png}
    \caption{Comparison of \textit{real-B} to \textit{syn-real-B} using slices from 3 random patients. Depicted are bladder (blue), bowel (red), rectum (green), sigmoid (yellow).}
    \label{fig:cmp_slices}
\end{figure}

\clearpage

\section{Visually checking memorization of synthetic images} \label{app:viz_cmp_syn_real}

In Figure \ref{fig:syn-real-cmp}, we provide examples of synthetic images that were too similar to real images and therefore discarded. Visually, the synthetic images do not appear to be exact copies of real images.

\begin{figure}[h]
    \centering
    \includegraphics[width=0.7\textwidth]{_figures/7_appendix/lung0.png}
    \includegraphics[width=0.7\textwidth]{_figures/7_appendix/10.png}
    \includegraphics[width=0.7\textwidth]{_figures/7_appendix/12.png}
    \caption{Examples of discarded synthetic images and three nearest-neighbor real images for each (annotated with distances to the synthetic image).}
    \label{fig:syn-real-cmp}
\end{figure}

\section{Federated learning baselines} \label{app:fedlearn}

To compare our approach to federated learning, we run two federated learning baselines: Federated Averaging (FedAvg)~\cite{mcmahan_communication-efficient_2017} and Distributed Synthetic Learning (DSL)~\cite{chang_mining_2023}. 

FedAvg trains a segmentation model on the real data at each location and periodically averages the weights to get a global model. We implement FedAvg atop nnU-Net for fair comparison to our approach. DSL is a state-of-the-art approach to training a GAN in a federated manner to be used for generating synthetic data and training a U-Net. Our DSL experiments are based on the official implementation of DSL, see our fork at \url{https://github.com/AwesomeLemon/DSL_All_Code}. 

Since the key contribution of our method is its automatic and hyperparameter-free nature, we run the baselines in a similar setting of no hyperparameter tuning. For FedAvg, we used 1000 communication rounds with 1 epoch of local training in between since it was the best setting in~\cite{mcmahan_communication-efficient_2017}. For DSL, we followed the paper~\cite{chang_mining_2023} and the official code base. For hyperparameters that differed across the three datasets in DSL, we used the median values: $\lambda_{\mathrm{L1}}$: 300, 150, 100 → 150; batch size: 6, 6, 3 → 6).

The results of the experiments are reported in Tables~\ref{tab:fl_cervix}, \ref{tab:fl_lung}, \ref{tab:fl_polyp}. FedAvg on average has 5.3 worse Dice Score (DS) than HyFree-S3 in the i.i.d. setting of \textbf{Lung}, with the gap widening further in the non-i.i.d settings: 10.7 DS for \textbf{Cervix}, 29.2 DS for \textbf{Polyp}. DSL performs substantially worse in all settings (on average: \textbf{Lung}: $32.8$ DS worse, \textbf{Cervix}: $35.1$ DS worse, \textbf{Polyp}: $36.9$ DS worse). The poor performance of DSL was unexpected, given the excellent results in~\cite{chang_mining_2023}. We attribute this to untuned hyperparameters. We have checked that it learns to generate relatively realistic images (see Figure~\ref{fig:cmp_real_hyfrees3_dsl}), on which the U-Net that it trains performs well, however it generalizes to real test images poorly. 

We additionally did minimal manual hyperparameter tuning of DSL in the Polyp setting and were able to improve its performance by 9.2 DS on average (``DSL (lightly tuned)'' in Table~\ref{tab:fl_polyp}; we only changed the hyperparameters of the U-Net: we switched the optimizer to AdamW with default hyperparameters, switched step-wise learning rate schedule to cosine annealing, added random color jitter and random rotation augmentations). 

While the performance of FedAvg and DSL could potentially be improved further via hyperparameter tuning, their default performance is subpar, highlighting the benefit of automatic methods such as HyFree-S3, the hyperparameter-free nature of which is its key benefit.

\begin{table}[h!]
    \fontsize{8pt}{8pt}\selectfont
    \centering
    \caption{Comparison to the federated learning baselines for the \textbf{Cervix} data. The results of HyFree-S3 for sites A/B correspond to syn-real A/B (see~\sectionref{exp:setup}). Mean $\pm$ st. dev. are reported for 5 folds.}
    \begin{NiceTabular}{@{}l|@{}c@{\hspace{2pt}}|cc|cc@{}} \toprule
    \Block{2-1}{Metric} & \Block{2-1}{\hspace{2pt}Test\\\hspace{1pt}site} & \Block{1-4}{Method} \\
    \cmidrule{3-6}
    & & syn-real A & syn-real B & FedAvg & DSL \\\midrule
    \Block{2-1}{DS} & A & $80.2 \pm 1.7$  & $78.5 \pm 2.1$ & $65.0 \pm 1.9$ & $42.8 \pm 10.4$\\
    & B & $74.1 \pm 1.8$  & $75.9 \pm 1.6$ & $67.7 \pm 2.3$ & $41.2 \pm 0.6$\\
    \bottomrule
    
    \end{NiceTabular}
    \label{tab:fl_cervix} % hyfree-s3 avg: 77.1, fedavg avg: 66.4, dsl avg: 42.0
\end{table}

\begin{table}[h!]
    \fontsize{8pt}{8pt}\selectfont
    \centering
    \caption{Comparison to the federated learning baselines for the \textbf{Lung} data. The results of HyFree-S3 for sites A/B correspond to syn-real A/B (see~\sectionref{exp:setup}). Mean $\pm$ st. dev. are reported for 5 folds.}
    \begin{NiceTabular}{@{}l|@{}c@{\hspace{2pt}}|cc|cc@{}} \toprule
    \Block{2-1}{Metric} & \Block{2-1}{\hspace{2pt}Test\\\hspace{1pt}site} & \Block{1-4}{Method} \\
    \cmidrule{3-6}
    & & syn-real A & syn-real B & FedAvg & DSL \\\midrule
    \Block{2-1}{DS} & A & $76.5 \pm 1.6$  & $76.3 \pm 1.7$ & $71.2 \pm 2.5$ & $43.8 \pm 13.4$ \\
    & B & $77.5 \pm 1.2$  & $77.3 \pm 1.4$ & $71.9 \pm 1.2$ & $44.3 \pm 15.4$\\
    \bottomrule
    
    \end{NiceTabular}
    \label{tab:fl_lung} % hyfree-s3 avg: 76.9, fedavg avg: 71.6, dsl avg: 44.1
\end{table}

\begin{table}[h!]
    \fontsize{8pt}{8pt}\selectfont
    \centering
    \caption{Comparison to the federated learning baselines for the \textbf{Polyp} data. The results of HyFree-S3 for sites A/B correspond to syn-real A/B (see~\sectionref{exp:setup}). Mean $\pm$ st. dev. are reported for 5 folds.}
    \begin{NiceTabular}{@{}l|@{}c@{\hspace{2pt}}|cc|ccc@{}} \toprule
    \Block{2-1}{Metric} & \Block{2-1}{\hspace{2pt}Test\\\hspace{1pt}site} & \Block{1-5}{Method} \\
    \cmidrule{3-7}
    & & syn-real A & syn-real B & FedAvg & DSL & DSL (lightly tuned) \\\midrule
    \Block{2-1}{DS} & A & $90.3 \pm 1.4$  & $82.7 \pm 3.4$ & $54.3 \pm 3.8$ & $51.8 \pm 1.1$ &  $60.7 \pm 2.7$ \\
                    & B & $81.1 \pm 7.4$  & $82.2 \pm 3.9$ & $55.2 \pm 10.1$ & $42.3 \pm 12.1$ &  $51.9 \pm 7.4$\\
    \bottomrule
    
    \end{NiceTabular}
    \label{tab:fl_polyp} % hyfree-s3 avg: 84.0, fedavg avg: 54.8, dsl avg 47.1, dsl' avg 56.3
\end{table}

\begin{figure}[htbp]
\floatconts
  {fig:cmp_real_hyfrees3_dsl} % Label for the whole figure
  {\caption{Random sample of \textbf{Lung} images (Site A, fold 0)}} % Caption for the whole figure
  { % Start of the content of the figure
    \subfigure[Real]{%
      \includegraphics[width=0.29\linewidth]{_figures/7_appendix/example_grid_real.png}
      \label{fig:sub1}
    }\hspace{5mm}
    \subfigure[HyFree-S3]{%
      \includegraphics[width=0.29\linewidth]{_figures/7_appendix/example_grid_hyfrees3.png}
      \label{fig:sub2}
    }\hspace{5mm}
    \subfigure[DSL]{%
      \includegraphics[width=0.29\linewidth]{_figures/7_appendix/example_grid_dsl.png}
      \label{fig:sub3}
    }\vspace{-10pt}
  }
\end{figure}

\section{Further discussion of memorization} \label{app:memo}

We have been communicating with a data protection officer about the perspectives of synthetic data since this project started. Based on preliminary discussions, our method could greatly simplify data sharing between institutes and decrease privacy risks. However, currently, there are no official, internationally accepted, guidelines as to what constitutes memorization (of imaging data). 

Therefore, it is difficult to say if our approach is ``clinically acceptable'', given lack of prior work on what that entails. While we are confident in our analysis, it could be made more robust by using several dissimilar embedding models, or by increasing the threshold. Still, our method relies on empirical performance of neural networks (as noted in Section~\ref{met:mem}), and therefore no mathematical guarantees can be given that no memorization occurs. Nonetheless, we believe that empirical evaluation similar to ours could be enough for clinical acceptance, once the guidelines are determined.

To further investigate the memorization phenomenon, we use an alternative method for finding memorized images from a concurrent work~\cite{dar_unconditional_2024}, where it was successfully used to find medical images memorized by a diffusion model. The method is similar to the approach we used in that it relies on embeddings of images via a neural net, with the two differences from our work being the model (the authors trained their own embedding model on the target dataset) and the similarity measure (the authors used correlation).

We apply this approach in the \textbf{Lung} setting. Whereas our method flagged close to zero samples as memorized, this approach flags $\approx 12\%$. However, visual inspection of the synthetic samples closest to some real ones (Figure~\ref{fig:syn-real-cmp-altmemo}) indicates that the synthetic samples are not duplicates of the real ones (based on our judgement). While similar, they differ in their details, and do not satisfy the properties on which the definition of memorization from~\citet{dar_unconditional_2024} is based: they are not variants of the original image derived via rotation, flipping, or contrast adjustment.

Nonetheless, memorization is difficult to define, and perhaps the eventual guidelines would enforce stricter definitions of memorization that would include these samples. If so, such synthetic outputs could be removed by using an appropriate embedding model within our method. HyFree-S3 is agnostic to the duplicates removal technique used within, better techniques can be substituted when they are developed. 

\begin{figure}[h]
    \centering
    \includegraphics[width=0.7\textwidth]{_figures/7_appendix/visualize_closest5.png}
    \caption{Alternative memorization detection: examples of synthetic images flagged as memorized and three nearest-neighbor real images for each (annotated with similarity to the synthetic image). The synthetic images with the highest similarity to real images are visualized.}
    \label{fig:syn-real-cmp-altmemo}
\end{figure}

\section{Pretraining using single-site synthetic data} \label{app:syn-local}

Does pretraining on multi-site synthetic data have additional benefit over pretraining on single-site synthetic data? Here we compare our default setting (generating $10 \times n_{\mathrm{real}}$ images at each site and pulling them together) to the setting of generating $20 \times n_{\mathrm{real}}$ images at one site. In both settings, a U-Net is pretrained with the synthetic images and fine-tuned with the real local data. The experiments are performed for two sites with the \textbf{Cervix} data. Table~\ref{tab:synlocal} demonstrates that the networks pretrained with the local synthetic data (\textit{syn-local-real}) achieve on average 2.7 DS worse performance than the networks pretrained on pooled synthetic data (\textit{syn-real}). This result empirically demonstrates the benefit of bringing data from multiple sites together.

\begin{table}[h!]
    \fontsize{8pt}{8pt}\selectfont
    \centering
    \caption{Comparison of no pretraining (\textit{real}) to pretraining on local synthetic data only (\textit{syn-local-real}) or pooled synthetic data (\textit{syn-real}) in the \textbf{Cervix} setting. Mean $\pm$ st. dev. are reported for 5 folds.}
    \begin{NiceTabular}{@{}l|@{}c@{\hspace{2pt}}|cc|cc|cc@{}} \toprule
    \Block{2-1}{Metric} & \Block{2-1}{\hspace{2pt}Test\\\hspace{1pt}site} & \Block{1-6}{Training setting} \\
    \cmidrule{3-8}
    & & real A & real B & syn-local-real A & syn-local-real B & syn-real A & syn-real B \\\midrule
    \Block{2-1}{DS} & A & $80.1 \pm 1.9$ & $72.2 \pm 3.9$ & $80.0 \pm 1.8$ & $72.9 \pm 3.2$& $80.2 \pm 1.7$  & $78.5 \pm 2.1$\\
                    & B & $70.8 \pm 2.5$ & $73.7 \pm 4.4$ & $71.6 \pm 1.7$ & $73.3 \pm 3.8$ & $74.1 \pm 1.8$  & $75.9 \pm 1.6$\\
    \bottomrule
    
    \end{NiceTabular}
    \label{tab:synlocal}
\end{table}

\section{Comparison to the standard StyleGAN2 architecture} \label{app:square-gan}

To experimentally confirm that transferring the nnU-Net architectural parameters to StyleGAN2 is reasonable, we compare our GANs to the standard StyleGAN2 trained with square images of a resolution of a power of two. The experiment is performed with the Polyp-B data, for which the nnU-Net determined resolution ($388\times320$) is the farthest from a square where the dimensions of the sides are a power of two (note that for such images our StyleGAN2 architecture would be exactly the same as the standard one). The images are resized to the closest power of two ($256\times256$), and the standard StyleGAN2 is trained. Then synthetic images are generated and resized to $388\times320$. Afterwards, the Frechet Inception Distance (FID)~\cite{heusel_gans_2017} is calculated between them and the real images. FID is an established metric of GAN quality, lower values are better. 

We compare the FID of the StyleGAN2-$256\times256$ to our StyleGAN2-$388\times320$, and  find the latter to achieve better results ($102.4 \pm 4.0$ FID vs $88.0 \pm 3.2$ FID). This confirms that the architecture is reasonable, as it was able to make use of the increased resolution of its inputs to achieve higher generation quality.