\vspace{-6pt}
\section{Method}
\vspace{-4pt}

\subsection{Overview of the method} \label{met:overview}
\vspace{-2pt}

We assume that $N$ sites (such as medical centers) have a goal of solving a segmentation task. Each site has a local dataset that cannot be straightforwardly shared due to privacy or security concerns. The datasets may differ in sizes and image characteristics per-site.

\figureref{fig:method} shows the flow of data and models in HyFree-S3. Firstly, each site runs hyperparameter-free methods to create a generative model and a segmentation model. A generative model (\sectionref{met:syn}) is used to create data without segmentations, which are then segmented by the segmentation model (\sectionref{met:nnunet}) to create a complete synthetic dataset for sharing. The reasoning behind using two separate models is given in \sectionref{met:sep-gen-seg}.

Next, the synthetic datasets from all sites are merged in a central location, and a general segmentation model is trained using all the synthetic data. 
This general segmentation model is transferred back to the sites, and is further automatically fine-tuned at each one. The resulting models benefit from the general pretraining but are specialized for each site.

\subsection{Hyperparameter-free medical image segmentation} \label{met:nnunet}

nnU-Net is a robust medical image segmentation method that was shown to perform excellently in a wide variety of competitions and benchmarks~\cite{isensee_automated_2021}. nnU-Net can adapt to diverse datasets without hyperparameter tuning thanks to heuristics for adjusting the underlying U-Net architecture~\cite{ronneberger_u-net_2015} and the training procedure. The spatial dimensions of the data influence the input size of the model, depths of the encoder (decoder), downsampling (upsampling) strides, and convolution kernel sizes.

To adjust the hyperparameters to the fine-tuning setting not considered by \citet{isensee_automated_2021}, we add linear learning rate warm-up for the first 10\% of training epochs~\cite{mosbach_stability_2020} and do not otherwise change the default hyperparameters.

\subsection{Hyperparameter-free data synthesis} \label{met:syn}

StyleGAN2~\cite{karras_analyzing_2020} is a powerful generative model that by default generates square images of a resolution of a power of two. However, medical images come in a variety of resolutions. Ideally, synthetic images of the appropriate resolution should be generated.

For segmentation tasks, nnU-Net adapts the structure of the U-Net to the resolution. We noticed a similarity between the structures of the generator/discriminator of a GAN and those of the decoder/encoder of a U-Net. Both the GAN generator and the U-Net decoder gradually upscale a low-resolution many-channel latent representation of an image towards a high-resolution few-channel output. The architectures need to strike the correct balance between the speeds of increasing the resolution and decreasing the number of channels. As such, a network that strikes the correct balance in one setting, seems likely to do so in the other. Analogously, the discriminator and the encoder gradually downscale a high-resolution few-channel input towards a low-resolution many-channel representation.

Therefore, it appears natural to reuse the hyperparameters of the encoder and the decoder automatically determined by nnU-Net (depths of the networks, convolution strides, and kernel sizes) for the discriminator and the generator of a GAN. This will allow it to create non-square non-power-of-two-sized images of the exact size determined by nnU-Net.

Next, the number of training steps, $n_{steps}$, needs to be set automatically. In StyleGAN2, $n_{steps}$ is defined in thousands of real images processed during training. As the dataset size increases, so should $n_{steps}$ (to allow the GAN to learn from a larger amount of data). Setting $n_{steps}$ to the number of images in the dataset ensures good image quality across different dataset scales, keeps training times short, and prevents memorization (see Section~\ref{exp:mem}).

The number of images to be generated, $n_{gen}$, also needs to be determined. While generating images with GANs requires little compute and time, generating increasingly large numbers of samples leads to diminishing returns~\cite{ravuri_seeing_2019}. We also need to consider the proportions of synthetic data coming from different sites used in training the general segmentation model: generative models trained on larger datasets should contribute more than those trained on smaller datasets, as they likely have higher quality. For these reasons, $n_{gen}$ is set to ten times the dataset size for each dataset.

We use the augmentations setup of~\citet{zhao_differentiable_2020} that was shown to improve performance and help avoid overfitting with datasets as small as 100 images.

\subsection{Why not synthesize images and segmentations jointly?} \label{met:sep-gen-seg}
% \vspace{-1pt}
\vspace{-2pt}

Segmenting with a separate model was a deliberate design choice and constraint. While it is possible to train a generative model to output segmentations as well as images, this would lead to segmentations influencing the images. This is undesirable for medical data sharing, as the references in many segmentation scenarios vary due to the protocol and observer variation. Letting these variations influence the generated \emph{images} should be avoided: then the images themselves can still contribute to the improvement of the models (e.g., during unsupervised pretraining) even if segmentations need to be discarded or redone. Additionally, if no annotations are available, HyFree-S3 can still be utilized to share images (unlike methods that condition on segmentations).

% \vspace{-3pt}
\vspace{-5pt}
\subsection{Measuring memorization and preventing real data leakage} \label{met:mem}
% \vspace{-1pt}
\vspace{-2pt}

Synthetic data should be similar enough to the real data for the models trained on one to transfer to the other, and dissimilar enough for the privacy concerns to be alleviated. Generative models are capable of memorizing their training data and outputting it as ``synthetic'' samples~\cite{feng_when_2021}. Memorization is difficult to determine because the reproduction typically includes some variation or noise. It is not obvious where the threshold between the presence and the absence of memorization should be.

We propose automatically determining this threshold based on the real data itself as follows: firstly, the patients are randomly split into two subsets. For each image in the first subset, its dissimilarity with each image in the second subset is computed using the L2 distance between their OpenCLIP embeddings~\cite{ilharco_openclip_2021}. The minimal dissimilarity (i.e., the distance to the nearest neighbour) is stored for each image. The threshold is then defined as the $p$-th percentile of these dissimilarities.

After the synthetic dataset is generated, the dissimilarity of each synthetic image to all real images is similarly computed. If the distance of a synthetic image to its nearest real-image neighbor is below the determined threshold, the image is declared to be memorized and is discarded. For $p = 0$ (the minimum of dissimilarities), this procedure ensures that any synthetic image is only as similar to a real image as one unrelated real image to another; we set $p = 5$ to guard against outliers. The procedure relies on the quality of the embedding model that can only be demonstrated empirically~\cite{cherti_reproducible_2023}. 