\section{Introduction}
\label{sec:introduction}
Phrase grounding refers to the ability of a model to map textual tokens to regions in an image.
Unlike typical object detection or segmentation tasks, phrase grounding usually takes natural language as input,
such as medical reports, instead of relying on a predefined set of categories.
Thus, phrase grounding can be seen as a generalization of object detection.

In the medical domain, phrase grounding can be used to localize anomalies in images based on textual descriptions
provided by experts~\cite{Bhalodia_2021}.
This is attractive, since it works without explicit labels, which are rare and expensive for medical data.
By inspecting phrase grounding performance, it is possible to infer which phrases or regions influenced the decision of the system,
assess whether the model made proper use of all available modalities, and if the modalities were aligned properly without
confusion~\cite{Parcalabescu_2020_phrase_grounding}.
These properties are vital for interpretability, which is a necessary requirement for models to be used in the medical field~\cite{Chen_2023}.
In addition, without interpretability, models could introduce harmful biases without explanations, which is
especially critical for high-risk decision-making~\cite{HAKKOUM2022_interpretability_medical}.

Existing discriminative medical phrase grounding approaches can be roughly categorized into two groups: 
supervised and self-supervised with contrastive
learning~\cite{Boecking_2022_MS_CXR,Gupta_2020,zhang_2022}.
An important supervised approach is MedRPG~\cite{Chen_2023}, which uses ground-truth bounding boxes of radiographs to formulate a contrastive loss based on the features of bounding boxes, as well as the joint attention of bounding boxes, the class token and an additional learnable token.
Another relevant branch of medical phrase grounding methods are those that work with 3D medical data, such as the paper by~\citet{Ichinose_2023}, which addresses the unique issues of phrase grounding in CT scans.
They suggest using a pre-trained segmentation model that labels the anatomic structures visible in the scan and introduce a module to structure the corresponding medical reports.
However, such supervised methods require ground-truth bounding boxes or annotators, which are difficult to obtain, especially in the medical domain.
Self-supervised methods do not require explicit labels but do not always lead to the desired result.
%The other option is to use self-supervised approaches via contrastive
%learning~\cite{Boecking_2022_MS_CXR,Gupta_2020,zhang_2022}.
Discriminative methods would usually evaluate their phrase grounding performance by computing the cosine similarity between the
text embeddings and the corresponding image embeddings.
However, it has recently been shown that phrase grounding tasks can also be solved using generative models in an unsupervised context \cite{Dombrowski_2024}.
Specifically, text-to-image Latent Diffusion Models (LDMs) are useful, due to their use of cross-attention to
combine the two modalities, as well as their ability to produce high-quality images.
Text-to-image LDMs are trained to generate images from a dataset while receiving additional text conditioning from the
corresponding text inputs ~\cite{Dombrowski_2023_ICCV,vilouras2024zeroshotmedicalphrasegrounding}.
Instead of using cosine similarity, the phrase grounding capabilities of LDMs are easier to evaluate by using their cross-attention layers.
Earlier, \citet{Dombrowski_2024} showed that using a frozen text encoder improves the phrase
grounding capabilities of an LDM.

So far, the self-supervised approach by~\citet{Boecking_2022_MS_CXR} achieved the highest phrase grounding performance on Chest X-ray (CXR) data
by fine-tuning a Large Language Model (LLM) pre-trained on the biomedical domain on CXR reports.
The resulting LLM is known as CXR-BERT.
This model is jointly trained with an image encoder, in a framework called BioViL.
In this work, we leverage these and use CXR-BERT as a frozen text encoder that conditions
the U-Net in an LDM.
Consequently, we inject the learned embeddings of CXR reports from CXR-BERT,
while additionally fine-tuning the U-Net on corresponding CXR images.
CXR-BERT and the LDM support each other in a bidirectional manner:
the LDM, having a generative architecture, is able to leverage the full phrase grounding potential of the text embeddings compared to the simple CNN that is used in BioViL.
Additionally, the powerful text embeddings learned by CXR-BERT provide the necessary conditioning to the LDM that enables the model to learn a well-grounded multimodal representation.

As a result, the contributions of our work include the following:
\begin{itemize}
%\setlength{\itemsep}{0pt}
\parskip0pt
    \item We demonstrate that a multimodal text encoder with domain-specific knowledge can vastly improve phrase grounding capabilities of an LDM.
    \item We show that generative approaches can yield far better phrase grounding results than traditional discriminative approaches by nearly doubling conventual performance metrics such as mIoU.
    \item We discuss a novel post-processing method that can boost the phrase grounding capabilities of phrase grounding frameworks.
\end{itemize}