\section{Ablation study}
\label{sec:ablations}
\input{tables/llm_comparison}
\noindent
The ablation studies investigate the impact of different domain-specific text encoders on phrase grounding.
As shown in~\tableref{tab:llm_comparison}, four different sampling methods are compared: only using generic
text conditioning during sampling, applying Conditional Free Guidance (CFG) during sampling, additionally giving the noisy ground-truth
image in the first timestep (GT-1 + CFG)  and additionally giving the noisy ground-truth image in every step (GT + CFG).
The metrics AUC-ROC and Top-1~\cite{Dombrowski_2024} are also included here.
The most important detail that can be seen in~\tableref{tab:llm_comparison}, is that CXR-BERT performs by far the best
of all tested models.
There are several attributes that distinguish CXR-BERT from the rest that could be responsible for that.
Unlike the domain-agnostic CLIP models, CXR-BERT is trained on domain specific CXR data.
In contrast to RadBERT, CXR-BERT is trained in a multimodal manner.
Compared to Med-KEBERT, CXR-BERT does not rely on any report preprocessing.
In comparison to CXR-CLIP, CXR-BERT has a considerably more complex pretraining procedure.
Additionally, CXR-BERT uses both local and global loss, which differentiates it from all other discussed models.
Adding a local loss term, paired with the domain-specific, multimodal training is most likely the key to the strong
performance of CXR-BERT.