 %Phrase grounding, the ability to map natural language text to specific image regions, can be used to localize diseases in chest X-rays based on the corresponding reports.
%Current state-of-the-art methods for phrase grounding primarily rely on self-supervised discriminative models trained with specific contrastive loss functions.
%However, generative text-to-image models also provide zero-shot phrase grounding capabilities.
%Notably, extracting cross-attention maps from fine-tuned Diffusion Models with a frozen text encoder has shown promising results.
%This method can be enhanced by a large margin using a domain-specific language model kept frozen during fine-tuning, as opposed to a domain-agnostic model.
% Remarkably, this result could be achieved despite some previous research suggesting that this approach would not yield any benefits.
%Among the tested text encoders, CXR-BERT, which was specifically fine-tuned for phrase grounding, performs the best.
% In particular, the mIoU metric was around twice as high compared to current phrase grounding and masking approaches. 
%In doing so, it can be shown that generative approaches seem to be  more suitable for phrase grounding tasks than discriminative ones.
%Additionally, we propose a novel post-processing method for the cross-attention maps called Bimodal Bias Merging that incorporates the alignment of the text bias and image bias to define regions of high certainty.
%This way, we can improve the accuracy of the cross-attention maps even further.
Phrase grounding, \emph{i.e.}, mapping natural language phrases to specific image regions, holds significant potential for disease localization in medical imaging through clinical reports. While current state-of-the-art methods rely on discriminative, self-supervised contrastive models, we demonstrate that generative text-to-image diffusion models, leveraging cross-attention maps, can achieve superior zero-shot phrase grounding performance. 
Contrary to prior assumptions, we show that fine-tuning diffusion models with a frozen, domain-specific language model, such as CXR-BERT, substantially outperforms domain-agnostic counterparts. This setup achieves remarkable improvements, with mIoU scores doubling those of current discriminative methods. These findings highlight the underexplored potential of generative models for phrase grounding tasks. 
To further enhance performance, we introduce Bimodal Bias Merging (BBM), a novel post-processing technique that aligns text and image biases to identify regions of high certainty. BBM refines cross-attention maps, achieving even greater localization accuracy. Our results establish generative approaches as a more effective paradigm for phrase grounding in the medical imaging domain, paving the way for more robust and interpretable applications in clinical practice. The source code and model weights are available at \url{https://github.com/Felix-012/generate_to_ground}.