\section{Interpretability Trade-off}
\label{sec:tradeoff}
\input{tables/fid}
While the presented phrase grounding performance of our model allows for a higher degree of interpretability than comparable models, this comes at a price.
\citet{Dombrowski_2024} discuss a trade-off between interpretability and performance in LDMs.
They note that models with weaker phrase grounding capabilities often produce lower-quality images, while those with
higher image quality tend to have poorer phrase grounding performance.
\tableref{tab:fid_comparison} shows that this pattern is evident in our experiments as well.
Using a frozen CXR-BERT encoder for text conditioning results in lower FID scores, but produces the best
phrase grounding performance.
Meanwhile, the frozen CLIP encoder, which has weaker phrase grounding, achieves better FID
scores.
A learnable CLIP encoder provides the highest image quality, but the lowest phrase grounding metrics.
This means that choosing the correct model for application in the clinical field needs to be carefully considered. 
Simply using the model with the best image generation capabilities might produce good images.
However, no trust can be put into the fidelity of these images, since their internal representations cannot be interpreted.
Additionally, their lacking alignment between the image and text modalities imply that these models have no proper understanding of what they are generating, which might lead to harmful biases and mistakes in the generated images.
Models with better phrase grounding capabilities might be more trustworthy, but their generated images lacking in quality can also be problematic for clinical applications.
Even if the model has a good internal representation of the modalities, if the model generates subpar or unrealistic images, these can hardly be used in clinical settings.
Currently, professionals need to choose a fitting balance between between quality and interpretability depending on their use case.