\section{Hyperparameter Optimization}
\label{sec:timesteps}
\input{figures/timestep_comparison}
All hyperparemter optimization was carried out on the ChestXRay14~\cite{wang_2017_chestxray} dataset.
In addition to CNR and mIoU, we also include the Top-1 metric~\cite{Dombrowski_2024} and the AUC-ROC metric here.
\figureref{fig:timestep_comparison} demonstrates that after about five timesteps, there are no significant improvements
in phrase grounding performance during image generation using a noisy ground-truth image as input.
%However, due to some slight increases in performance, the best performance considering all metrics seems to be at
%around 65 timesteps.
This indicates that our results do not depend heavily on this hyperparameter.
It seems reasonable that the results saturate early for this sampling method compared to the others, since the model is given
significantly more information in form of the ground-truth images.
%However, using only text conditioning, the best phrase grounding performance is typically achieved at around 40
%timesteps according to most metrics.
Using the noisy ground-truth image only in the first denoising step is the same to our main approach initially.
This behavior is intuitive, since using the ground-truth only in the first step means essentially using it in all steps if only a single timestep is used in total.
As the number of timesteps increases, the performance gradually becomes closer to the results from text-only conditioning.
However, while there is a steep decline after the first few steps if the ground-truth image is only used in the first
sampling step, the performance eventually stabilizes.

\input{figures/timestep_selection}

Figure~\ref{fig:timestep_selection} showcases how the selection of the last $n$ timesteps during
ground-truth sampling affects the results.
The figure is constrained to 65 timesteps, since this configuration produced the best results given all of the four phrase
grounding metrics.
Also, this can be seen as an exemplary result, since the observed trends are very similar for all timesteps in a sensible
range.
As can be seen in~\figureref{fig:timestep_selection}, the metrics do not show the same behavior over the number of selected timesteps.
CNR and AUC-ROC steadily increase the more timesteps are selected.
However, the progression of mIoU is concave, peaking at selecting 45 timesteps and then decreasing.
Meanwhile, Top-1 peaks at 5 timesteps and then shows a tendency to increase over time.

Since AUC-ROC and CNR are closely related metrics, it makes sense that they again show similar behavior.
When incorporating more timesteps, noise introduced in single timesteps becomes less relevant.
Both AUC-ROC and CNR give worse results when more noise is introduced to the signal, which is why these metrics typically perform better for a larger number of timesteps.
Meanwhile, Top-1 only incorporates the highest activations, which is why this metric is extremely robust to noise. 
Therefore, only selecting a low number of timesteps can work well.
Top-1 most likely decreases when selecting a larger number of timesteps, since early timesteps focus on coarse features, resulting in larger activation areas.
Consequently, it is more likely that the highest activation is no longer strictly within the ground-truth bounding box.
Meanwhile, the mIoU metric can tolerate a certain amount of noise, due to the thresholding applied when creating the binary masks.
However, our masking approach generally has a tendency to include too much of the signal as part of the mask.
So both too much noise and too large activation areas decrease the quality of the generated mask.
Therefore, a compromise between both, resulting in a selection of roughly half of the timesteps, results in the best mIoU values in this setup.

When looking at the results, one should keep in mind that the changes are all very low, so the number of selected
timesteps does not play a considerable role as a hyperparameter either.