\input{figures/method}
\section{Method}
\label{sec:method}
\subsection{Exchanging Text Encoders in Stable Diffusion}
\label{subsec:exchange}

Our approach is based on fine-tuning Stable Diffusion (SD)~\cite{Rombach_2022_CVPR}, a popular pre-trained text-to-image LDM.
We only train the underlying U-Net of SD while keeping the text encoder frozen, since~\citet{Dombrowski_2024}
demonstrated that maintaining the original configuration of the text encoder yields superior phrase grounding results.
%The corresponding Variational Auto Encoder is also kept fixed during training, which allows us to precompute the latent representations beforehand,
%saving both resources and compute time during training.

We first fine-tune the U-Net by using the text encodings of the original pre-trained text encoder
(CLIP-ViT-L/14) of SD version 1.5~\cite{Rombach_2022_CVPR} on our training dataset.
Meanwhile, the parameters of the text encoder are kept frozen.
We then compare these baseline results by replacing the original text encoder with a frozen CXR-BERT~\cite{Boecking_2022_MS_CXR} encoder
in additional training runs.
CXR-BERT is a multimodal language model pre-trained on the CXR domain.
Due to being pre-trained on text and image inputs in a specific domain, CXR-BERT can provide
better text representations than the standard text encoder of SDs, which is trained on a largely domain-agnostic
subset of LAION-5B~\cite{Schuhmann_2022}.

Additionally, phrase grounding is inherently a task in which the image and the text modality need to be properly aligned.
Therefore, LLMs that received both vision and textual learning signals are especially suited for phrase grounding tasks.
This intuition is also supported by previous work, which already demonstrated that domain-specific text encoders with
multimodal pre-training perform well during phrase grounding tasks~\cite{Boecking_2022_MS_CXR}.
Still, some research, such as the paper by~\citet{Bluethgen_2024} suggests that using a domain-specific text encoder for LDMs does not yield any benefits in the CXR domain.
However, another interesting property of CXR-BERT~\cite{Boecking_2022_MS_CXR}, is the use of both global and local loss.
Typically, methods tend to use some variant of a global loss when pre-training a model, which computes the loss on image and phrase level.
%But intuitively, it seems sensible to include a local loss term as well, which focuses on the alignment of words and
%image regions.
But including a local loss, that aligns words with image regions, better reflects the bottom-up structure of phrase grounding, since each individual token is associated with a region in the image.
%If this association is accurate, a fitting text-to-image
%mapping emerges.

\subsection{Cross-Attention Map Extraction}
\label{subsec:extraction}
An overview of the cross-attention map extraction process can be seen on the left side of~\figureref{fig:method}.
The text inputs for CXR-BERT first need to be tokenized by its corresponding tokenizer with maximal
token length $N_{\max}$ into tokens $\tau_1, \dots, \tau_{N_{\max}}$.
Since only words with lexical meanings can be mapped to image regions, as demonstrated in~\figureref{fig:tokens}, we remove tokens corresponding to function words
by employing ScispaCy~\cite{neumann-etal-2019-scispacy}.
This approach yields a small improvement compared to the token processing method used by \citet{Dombrowski_2024} (see Appx.~\ref{sec:processing}).
%The text encoder then computes a learned representation %for each token, resulting in a tensor of dimension
%$(T_{\max} \times D)$ for the hidden dimension $D$ of the encoder. \\
From here, we are following the approach by~\citet{Dombrowski_2024}, meaning we are mostly interested in the probability matrix $P$, defined as
\begin{equation}
    \label{eq:attention}
     P = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right),
\end{equation}
whereas $Q$ is the query, $K$ is the key and $d_k$ is the dimension of the attention embedding.
In particular, for the batch size $B$, the layers $L$, the image height $H$ and the image width $W$, $K$ is a learned linear projection of the text embeddings with dimension $(B * L \times N_{\max} \times d_k)$, while $Q$ is a learned linear projection of the image embeddings with dimension $(B * L \times H * W \times d_k)$.
Their inner product, the matrix $P$, is the basis for the cross-attention masks and is of dimension $(B * L \times H*W \times N_{\max})$.
Consequently, for each sample in the batch and for each layer, P sets each pixel and each token embedding in relation to each other, making it suitable to  evaluate visual grounding.
This matrix is generated and saved for each timestep during inference.
Therefore, when reshaping $P$ correctly and upsampling the image dimensions to our latent image size of 64, this results in a tensor with dimension
$(B \times T \times L \times N_{\max} \times 64 \times 64)$ for the number of timesteps $T$.
This allows us to easily select specific layers, timesteps and tokens.
The 2D activation maps $P_{comb}$ can then be obtained by simply averaging over the first three dimensions of each item in the
batch and excluding the start and end tokens.
This corresponds to computing the average over the attention maps for each timestep, layer and token.
When we use lexical filtering, only relevant tokens are used for this averaging.
An intuition for this approach is provided by~\figureref{fig:tokens}.
Unlike other text encoders that would result in focusing on unnecessary details such as the ribcage for attention maps in higher resolutions, CXR-BERT produces strong attention maps across all resolutions in the U-Net, which is why we can average over the layers (see Appx.~\ref{sec:layers} for an example).
Due to this consistency of CXR-BERT, we also do not need to select specific timesteps.
To obtain accurate localization capabilities from these maps, the model needs both an image input and a textual input.
Therefore, instead of starting with Gaussian noise during sampling, the ground-truth image is used as input in each
timestep.
The appropriate noise for the current timestep is added to a fresh input image in each step.
The corresponding binary mask for computing mIoU is obtained via fitting a Gaussian Mixture Model to the activation map.

\subsection{Bimodal Bias Merging}
\label{subsec:bias}
To further improve the accuracy of the 2D activation maps $P_{comb}$, we incorporate more information via a process we call Bimodal Bias Merging (BBM).
An overview of this method can be seen on the right side of~\figureref{fig:method}.
In this process, we combine the activation maps from~\sectionref{subsec:extraction} with the textual bias and image bias of the model, as motivated by the results of~\sectionref{sec:phrase-grounding-benchmarks}.
To this end, we only extract the cross-attention values of $P$ that correspond to the start token, which
represents the image bias of the model.
This way, we end up with a tensor $P_\text{img}$ of dimension $(B \times T \times L \times 1 \times 64 \times 64)$.
For the textual bias of the model, we need to sample the LDM again, but with the usual Gaussian noise as image input this time.
This results in a tensor $P_\text{txt}$ of dimension $(B \times T \times L \times N_{\max} \times 64 \times 64)$.
By combining $P_\text{txt}$ and $P_\text{img}$ via matrix multiplication (denoted as $\otimes$), we capture cross-modal interactions between the two representations.
Empirically, $P_\text{mult} = P_\text{img} \otimes P_\text{txt}$ consists of large radial gradients that show the most likely locations of the disease.
To have a measure for the accuracy of this map, we compute the structural similarity index measure~\cite{ssim_wang} $s$ between the
map for the text bias and the image bias, clipped to the range $[0,1]$.
Finally, to obtain our new activation map $P_\text{BBM}$, the bias interaction map and the original activation map are interpolated via the following quadratic Bézier curve:
\begin{equation}
    \label{eq:lbm}
    P_\text{BBM} = 2(1-s)s\left(\frac{P_\text{mult} + P_{\text{comb}} + P_\text{mult} \odot P_{\text{comb}}}{2}\right) 
+ (1-s)^2 P_{\text{comb}}
+ s^2 P_\text{mult}
\end{equation}
with $\odot$ being the Hadamard product.
This interpolation is essentially linear, except for the control point receiving additional information regarding the multiplicative interaction between the biases.
In its base form, BBM is the linear interpolation $sP_{\text{comb}} + (1-s)P_{\text{mult}}$, which improves the activations around the location of the disease, as can be measured with CNR.
However, this typically does not improve the generated attention maps, as measured with mIoU, since thresholding would include even low activations as part of the masks.
For this purpose, we introduced a control point to the equation that essentially serves as a gating mechanism that constricts the activation areas to adhere to the merged biases. 
Therefore, it is primarily relevant when computing masks.
By construction, Equation~\ref{eq:lbm} remains close to a linear interpolation, despite utilizing the gating mechanism, which considerably improves the activation maps, while also giving a slight boost to the masks.
Since the activation maps are primarily supposed to increase interpretability, it should be easy to inspect them with the human eye.
The main benefit of using BBM is that the activations are much clearer to see, which is more meaningful than simply using masks.
For more details on the interpolation, see Appx.~\ref{sec:processing}.

In this way, the original map is combined with the modality interaction map based on the calculated confidence score.
This is based on the heuristic that, if the image bias and text bias are similar, then the fields created by their merging are more likely to support finding an accurate location of the disease.
Meanwhile, if the two biases have large discrepancies, their combined information is less likely to enhance the original activation map and should mostly be ignored.
