Keywords: deep learning, diffusion models, medical imaging, phrase grounding
Abstract: Localizing the exact pathological regions in
a given medical scan is an important imaging problem
that traditionally requires a large amount of bounding box
ground truth annotations to be accurately solved. However,
there exist alternative, potentially weaker, forms of supervision,
such as accompanying free-text reports, which are
readily available. The task of performing localization with
textual guidance is commonly referred to as phrase grounding.
In this work, we use a publicly available Foundation
Model, namely the Latent Diffusion Model, to perform this
challenging task. This choice is supported by the fact that
the Latent Diffusion Model, despite being generative in
nature, contains cross-attention mechanisms that implicitly
align visual and textual features, thus leading to intermediate
representations that are suitable for the task at hand. In
addition, we aim to perform this task in a zero-shot manner,
i.e., without any training on the target task, meaning that
the model’s weights remain frozen. To this end, we devise
strategies to select features and also refine them via postprocessing
without extra learnable parameters. We compare
our proposed method with state-of-the-art approaches
which explicitly enforce image-text alignment in a joint
embedding space via contrastive learning. Results on a
popular chest X-ray benchmark indicate that our method is
competitive with SOTA on different types of pathology, and
even outperforms them on average in terms of two metrics
(mean IoU and AUC-ROC). Source code will be released
upon acceptance at https://github.com/vios-s.
Submission Number: 85
Loading