\begin{figure}[t]
    \centering
    \includegraphics[width=0.9\textwidth]{images/method_figure_reworked}
    \caption{During fine-tuning, radiological text reports are extracted \textbf{(a)}.
    These reports are fed into CXR-BERT with frozen parameters \textbf{(b)}.
    The resulting text embeddings are used to condition the U-Net in the LDM, by injecting the embeddings into each cross-attention layer (represented as gray bars) in the U-Net.
    The LDM learns to generate images by getting noisy radiology images corresponding to the reports and a timestep as input \textbf{(c)}.
    During evaluation, noisy ground-truth images are repeatedly fed into the LDM to extract the corresponding cross-attention.
    These maps are processed based on relevant tokens and to get them into the correct dimensionality \textbf{(d)}.
    After a processing step, we obtain an activation map and its corresponding binary mask \textbf{(e)}.
    For BBM, we need to extract the image bias \textbf{(f)} and generate the text bias \textbf{(g)}, merge them and combine them with our original activation map \textbf{(h)}.}
    \label{fig:method}
\end{figure}