\section{Experiment Setup}
\label{sec:experiment_setup}
\subsection{Training Setup}
\label{subsec:training-&-optimization-setups}
Each run is performed using the same configuration:
we use a base learning rate of $5e^{-05}$, with 1000 warmup steps and a cosine learning rate scheduler.
The training is split over eight 80GB A100 GPUs with a batch size of 16 and two gradient accumulation steps each,
resulting in an effective batch size of 256.
To be reproducible, the used seeds were uniformly sampled and are 4200, 1759 and 6357.
For more efficiency, the model weights are converted to mixed precision.
During training, unconditional guidance training is applied, so the text conditioning would be dropped with a probability of 30\%.
Additionally, we keep an Exponential Moving Average of our U-Net, which is used for sampling.

\subsection{Datasets}
\label{subsec:datasets}
The base dataset for training and testing is the MIMIC-CXR dataset~\cite{Goldberger_2000_PhysioNet,Johnson_2019_MIMIC}. 
It contains pairings of CXR images and their respective reports, which include medical findings such as diseases. 
For training, we use the train split proposed by MIMIC-CXR-JPG~\cite{Goldberger_2000_PhysioNet,Johnson_2024_MIMIC_CXR_JPG}, which consists of 162,651 image-report pairs.
Explorative hyperparamter optimizations were conducted on the ChestXRay14 dataset~\cite{wang_2017_chestxray}.
%This split includes both posterior-anterior and anterior-posterior views.

Our test set consists of MS-CXR~\cite{Boecking_2022_MS_CXR}, a subset of MIMIC-CXR.
MS-CXR features improved bounding boxes, which can be used to evaluate the phrase-grounding performance of our models. 
Additionally, MS-CXR includes refined report descriptions that yield higher evaluation accuracy.


\subsection{Metrics}
\label{subsec:metrics}
In order to be comparable with~\citet{Boecking_2022_MS_CXR}, we report the Contrast-to-Noise Ratio (CNR) and mean Intersection over Union (mIoU) of our results.
CNR is calculated as $\text{CNR} = \frac{|\mu_{A_i} - \mu_{A_e}|}{\sqrt{\sigma^2_{A_i} + \sigma^2_{A_e}}}$, where $\mu_{A_i}$ and $\mu_{A_e}$ represent the means and $\sigma^2_{A_i}$, $\sigma^2_{A_e}$ the variances of the similarity scores inside and outside the bounding box respectively. Therefore, CNR can be used to evaluate the phrase grounding performance without the need of applying a threshold~\cite{Boecking_2022_MS_CXR}.
Additionally, we compute mIoU as the mean of the Jaccard distances $J(A, B) = \frac{|A \cap B|}{|A \cup B|}$ between the overlapping and non-overlapping regions of the thresholded phrase grounding image and the ground-truth bounding box.