
\section{Results \& Discussion}
\input{figures/phrase_grounding_tokens}
\label{sec:phrase-grounding-benchmarks}
In~\figureref{fig:tokens}, we can see how the model attends to different tokens in a sentence.
For example, the token for \textit{``consolidation''} shows the approximate region of the anomaly.
The activations for \textit{``right''} are on the right side, the one for \textit{``lower''} is on the lower part and so
on.
Meanwhile, tokens with no lexical content, such as \textit{``the''}, have no clear activation patterns.
The activation map of the start token is the image bias of the model, which works already remarkably well.
This implies that the model has a good internal representation of the diseases, even when considering the text and image modalities separately.
Meanwhile, the end token provides even better phrase grounding capabilities, since it can incorporate the knowledge of the preceding tokens.
Empirically, using the activation map of the end token produces similar results as using the mean of preceding tokens.

\input{tables/main_results}

\input{figures/phrase_grounding_examples}

As shown in~\tableref{tab:main_results}, using CXR-BERT for text-conditioning in an LDM leads to
superior results in phrase grounding than using CXR-BERT in its original framework
BioViL~\cite{Boecking_2022_MS_CXR}.
The cross-attention maps of the LDM yield better results for both metrics (\textit{i.e.}, CNR and mIoU)
reported by~\cite{Boecking_2022_MS_CXR}.
Our approach approximately doubles the mIoU results.
Consequently, our approach seems to be especially suited for mask generation.
Additionally, our setup achieves higher results across all diseases compared to BioViL, and for almost all diseases
compared to its improved version, BioViL-L.
As discussed in Appx.~\ref{sec:tradeoff}, we can also confirm the observed trade-off between interpretability and image generation quality that was observed by~\citet{Dombrowski_2024}.
Our post-processing method BBM increases the CNR results considerably.
The improvement for mIoU is smaller, which most likely stems from the fact that for the mask generation, the applied thresholding destroys some of the gained information.
Notably, since our model could not learn pneumothorax properly, BBM decreases the phrase grounding performance for this disease.
These low values for pneumothorax align with the findings by~\citet{Dombrowski_2024}, suggesting that this disease
may be particularly challenging to model.
This might be due to Pneumothorax being very inconsistent in terms of location and size, or the corresponding impressions being too short.
%In addition, this is coupled with the impressions often %being rather short when describing the disease.
Therefore, the model does not get enough information to model such a complex disease.

Furthermore, our approach improves the previous method by~\citet{Dombrowski_2024}, which is also based on the extraction of cross-attention maps.
However, the key difference is that we replaced the generic CLIP text encoder with the domain-specific
CXR-BERT text encoder.
This change greatly increases the performance of the model, as shown in~\tableref{tab:main_results}.
Consequently, we could show that a domain-specific LLM has the potential to greatly increase the phrase grounding potential of LDMs.
These results demonstrate that better phrase grounding can be achieved in a generative context compared to a
discriminative one, although the generative context has no specific alignment loss.

In~\figureref{fig:examples}, one can observe examples of cross-attention maps extracted from the LDM conditioned with a
CLIP text encoder and a CXR-BERT text encoder, as well as cosine similarity maps from BioViL.
As already demonstrated in~\citet{Dombrowski_2024}, employing a frozen CLIP text encoder yields solid phrase
grounding results.
However, it still performs worse than some domain-specific weakly supervised methods such as BioViL.
By employing text conditioning based on an encoder with strong phrase grounding capabilities in that domain,
the strengths of both methods are combined, resulting in the best outcomes.
As we can see in the second row of Figure~\ref{fig:examples}, BBM can sometimes correct inaccurate predictions made by our model, thus increasing its accuracy.
