
\noindent\textbf{Limitations:} %There are two main limitations of our work: First, we did not train our model for different seeds, since the resulting resource requirements would be beyond our means. 
%Second, our post-processing method BBM relies on the model already having learned powerful representations for phrase grounding and the model having a LDM architecture. So it probably will only work for methods closely related to ours.
First, while our model showed considerable improvements for the other diseases, it performed less optimally for Pneumothorax, which, as a disease, shows far less consistency in terms of shape and location in the body.
Due to these characteristics, Pneumothorax is a challenging disease to localize for both our method and comparing methods.
To properly learn to localize Pneumothorax, one example would be to directly fine-tune the model on a curated subset of the data, or alternatively, to take inspiration from other Pneumothorax detection works such as~\citet{Park_2022}, and apply knowledge distillation from a model fine-tuned specifically on Pneumothorax. 
However, we kept our fine-tuning process as general as possible to enable a fair comparison between methods.
Second, our BBM technique is strictly designed to complement models with strong phrase grounding capabilities and an LDM architecture. By focusing on well-aligned models, BBM optimally leverages their strengths, and the principles underlying our approach could inspire adaptations for other architectures or domains.
A detailed ablation study is provided in Appx.~\ref{sec:ablations}.
\section{Conclusion}
We demonstrated that domain-specific, multimodal text encoders, such as CXR-BERT, significantly enhance phrase grounding performance in LDMs, particularly in the medical imaging domain. By integrating such encoders, our approach nearly doubles key metrics like mIoU compared to state-of-the-art discriminative methods, establishing generative models as a superior alternative for this task. Additionally, we introduced BBM, which further refines cross-attention maps to improve localization accuracy and robustness.

Our findings highlight the untapped potential of generative models in aligning text and image modalities, providing a pathway toward more interpretable and trustworthy medical AI systems. While our work represents an advancement, it also underscores the importance of balancing interpretability with generative quality for clinical applications. Future research should focus on extending this approach to other medical domains and exploring strategies to optimize this balance further. 

%However, we have shown how using a domain-specific and multimodal text encoder can considerably improve the phrase grounding performance of LDMs compared to state-of-the-art methods.
%Additionally, we have developed one way to increase this performance by using a straightforward post-processing scheme.

%Appendix 