\section{Introduction}
Understanding radiologists' eye gaze patterns is crucial to deciphering the intricacies of spatial presentation of disease patterns in radiological scans. This auxiliary signal, in the form of eye gaze maps, has been recently harnessed by deep learning systems for medical image diagnosis~\cite{bhattacharya2022radiotransformer,bhattacharya2022gazeradar}. 
Medical experts dedicate years to honing their skills in diagnosing diseases from radiology images, meticulously mastering the identification of intricate disease patterns. This experience enables their visual-cognitive working mechanism to modify/improve with time and, in turn, finesses their way of looking at scans~\cite{bertram2016eye,kelahan2019radiologist,tourassi2013investigating}. Hence, the visual patterns of an expert can provide critical sub-visual information for a deep learning model to improve its meta-understanding of a radiology image~\cite{stember2020integrating}. 

\begin{figure}[htbp]
\floatconts
  {fig:teaser}
  {\caption{Three methods for generating CXRs from radiologists' transcripts. \textit{GazeDiff} (ours) generates more clinically accurate CXRs compared to baselines.
  }}
  {\includegraphics[width=0.65\linewidth]{figures/midl2024_teaser.pdf}}
\end{figure}

Several works have been done in the last decade on generative modeling with a special focus on content generation. Recently, a significant improvement has been made on this front with diffusion models.  Diffusion models are likelihood-based generative models that model the data distribution via an
iterative noising and denoising procedure~\cite{ho2020denoising} and achieve state-of-the-art performance in text-based image generation. Improvements have been made in diffusion models by introducing additional controls~\cite{zhang2023adding}. Furthermore, conditional generative models can be converted to a classifier~\cite{ng2001discriminative} and, similarly, text-to-image diffusion models can be used as zero-shot classifiers without any additional training~\cite{li2023your}. This can be achieved by repeatedly adding noise to the input image and computing a Monte Carlo estimate of the expected noise reconstruction losses for every class in the dataset. 
Recent advances in controllable diffusion models enable text-to-image diffusion models with additional controls for guided image generation. Providing additional controls can be done in two ways: a) training the diffusion models from scratch~\cite{huang2023composer}, and b) introducing light-weight adapters into pretrained diffusion models~\cite{zhang2023adding,li2023gligen,mou2023t2i}. More recently, multiple controls are also used to generate more diverse images~\cite{zhang2023adding,zhao2023uni,qin2023unicontrol}. Deep learning models can recognize a shape better if they can learn to generate better~\cite{hinton2007recognize}. This fact goes back deep into the psychological paradigm of mechanisms to improve deep models, where generative modeling can act as a crucial player in discriminative tasks like classification. In medical image generation tasks, clinically explainable conditions and text conditions are important in generating realistic radiology images; this is still relatively unexplored. 

Radiologists' eye gaze patterns are strong clinical meta-features that are highly relevant in understanding disease patterns and associated diagnoses. \textit{Can these eye gaze patterns serve as suitable controls for diffusion models?} In this work, we propose a novel approach to integrate this expert visual attention as an additional control to the text-to-image diffusion models. Here, the text condition is the radiologist's transcript while viewing an image and contains disease-specific and context-rich information.
Our proposed architecture, \textit{GazeDiff}, utilizes these text conditions and visual attentions as additional controls for
medical image generation (Figure~\ref{fig:teaser}).

Even though machine learning models can benefit from experts' eye gaze patterns, it is time-consuming, expensive, and often impractical to obtain eye gaze in real-time decision-making scenarios. We address this problem by adapting \textit{GazeDiff} as a zero-shot classifier. Similar to ~\cite{li2023your}, the gaze-conditioned stable diffusion model is used as a zero-shot classifier without any additional training. 
In our work, we show that the proposed method outperforms the baselines in classifying both known and unknown classes.
In summary, we propose a novel gaze-guided zero-shot diffusion classifier, \textit{GazeDiff}, for pulmonary disease classification.\\
\textbf{Motivation and Overview.} The motivation for our work stems from generating clinically accurate medical images. We hypothesize that the context-rich visuo-cognitive information of radiologists' eye gaze patterns can be used as a clinically-relevant condition for image generation. 
To do this, first, we add the eye gaze patterns of experts as additional controls and radiologists' transcripts as text conditions to the text-to-image diffusion models. Then, we show that this helps in generating clinically accurate images and finally we use this finetuned diffusion model as a zero-shot classifier for downstream pulmonary diseases like pneumonia and tuberculosis classification tasks.
The key contributions of this paper are as follows: a) we propose to add eye gaze patterns of experts as additional control and radiologists' text as prompt to the text-to-image diffusion models. b) we use this finetuned diffusion model as a zero-shot classifier for downstream pulmonary diseases like pneumonia and tuberculosis classification tasks.

