\vspace{-.2in}
\section{Experiments and Results}
\subsection{Datasets and Experiments} We use the radiologist's eye gaze and corresponding transcriptions from publicly available Eye Gaze Data for Chest X-rays~\cite{karargyris2021creation,Karargyris2020} (n=1083). For zero-shot classification, we show results on two pneumonia classification and two tuberculosis classification datasets. For pneumonia classification, we use the test set of publicly available Cell Pneumonia dataset~\cite{kermany2018identifying} and RSNA Pneumonia Detection challenge dataset~\cite{shih2019augmenting}. For tuberculosis classification, we use the NLM MCU~\cite{candemir2013lung,jaeger2013automatic,jaeger2014two} and CHN~\cite{jaeger2014two} dataset which are obtained from Montgomery County, Maryland, USA and Shenzhen No. 3 People’s Hospital in China, respectively. The REFLACX dataset~\cite{bigolin2022reflacx,lanfredi2021reflacx} is used for evaluating the quality of the generated images.\\
For finetuning, we train the SD v1.5 for $15000$ steps with image size of $512\cross512$ on 1 Quadro RTX 8000 (48 GB) with a batch size of 4 and a learning rate of $1e-5$. With this finetuned SD as the base, we train a ControlNet model with HVA edge maps. Here we compute Canny edges~\cite{canny1986computational} of the global and focal HVA maps separately, as shown in \figureref{fig:ablation}. This training was performed for $10$ epochs with a batch size of $2$ and a learning rate of $1e-5$. During inference, we merge the global and focal ControlNets with UniPCMultistepScheduler~\cite{zhao2023unipc} for sampling. The number of time steps is set to 50 and the condition scale is set to 2.5. Also, we use $\ell_2$ norm for $\epsilon$-prediction error.
\vspace{-.15in}
\subsection{Quantitative Results}
We show results on the known class (samples from this class present in training SD-CXR and CN-Gaze), i.e. pneumonia classification, and the unknown class (samples from this class not present in training SD-CXR and CN-Gaze), i.e. tuberculosis classification.\\
\input{tables/table_1}
\textbf{Evaluation of Zero-shot classification Performance.} \textit{GazeDiff} is compared against standard diffusion models like SD\cite{ho2020denoising}, ControlNet\cite{zhang2023adding} trained on natural images, and the same models finetuned on CXRs (i.e. SD-CXR and CN-CXR). We also evaluate the performance of \textit{GazeDiff} against RoentGen\cite{chambon2022roentgen}, a baseline SD model finetuned with CXR images. To measure the classification performance, we report Accuracy($\uparrow$) and F1-score($\uparrow$) in \tableref{tab:quantitative}, and we report additional metrics in Appendix \tableref{tab:quantitative_appendix}. \textit{GazeDiff} outperforms the baselines on all 4 benchmark datasets for Pneumonia and Tuberculosis classification. We observe that \textit{GazeDiff} outperforms the finetuned SD model by \textbf{0.62$\pm$0.48\%} [Cell: 1.35\%, RSNA: 0.60\%, CHN: 0.51\%, MCU: \textit{no improvement}], the finetuned ControlNet model by \textbf{1.15$\pm$0.65\%}  [Cell: 2.16\%, RSNA: 0.66\%, CHN: 0.51\%, MCU: 1.27\%] and RoentGen by \textbf{15.60$\pm$4.55\%}  [Cell: 8.90\%, RSNA: 17.13\%, CHN: 14.87\%, MCU: 21.51\%]. We also show comparisons with CLIP\cite{radford2021learning} and PubMedCLIP\cite{eslami2021does}.\\ 
\input{tables/table_3}
\textbf{Evaluation of image quality}. In \tableref{tab:qualitative}, we report FID($\downarrow$), and CLIP-score($\uparrow$) to evaluate the performance of \textit{GazeDiff} for generated images quality and compare it with ControlNet. We show that \textit{GazeDiff} outperforms ControlNet on 4 pulmonary disease types. Additional results are reported in Appendix \tableref{tab:qualitative_appendix_1} and \tableref{tab:qualitative_appendix_2}.

\subsection{Ablation Analysis}
\begin{figure}[htbp]
\floatconts
  {fig:ablation}
  {\caption{\textbf{Ablation Analysis.} (a.*) images for focal HVA computations. (b.*) images for global HVA computations. (*.1) raw fixations overlayed on the CXR. (*.2) the HVA map. (*.3) the canny edge map. (*.4) the \textit{GazeDiff} generated CXR. The \textcolor{red}{red arrows} show disease patterns generated in the HVA regions.
  }}
  {\includegraphics[width=1.0\linewidth]{figures/midl2024_ablation.pdf}}
\end{figure}
In \tableref{tab:ablation}, we show the performance of \textit{GazeDiff}, when trained with different human visual attentions. From a radiologist's eye gaze patterns, we calculated two different visual attention patterns, namely global attention and focal attention, described in detail in Appendix \ref{appendix_hva}. Here, we observe that the ControlNet finetuned with the combined global and focal attention mechanisms generate better noise representation for zero-shot classifications and outperform the Global model by \textbf{1.35\%} and the Focal model by \textbf{1.62\%}.\input{tables/table_2}In \figureref{fig:ablation}, we show the generated CXRs for the different human visual attention canny edge maps. Here, we observe that the generated images show distinct irregularities in locations where there are canny edges of the human visual attentions. This demonstrates the robust interpretation of the experts' eye gaze content semantics for medical image generation. 

\subsection{Qualitative Results}
\begin{figure}[htbp]
\floatconts
  {fig:qualitative_1}
  {\caption{\textbf{Qualitative Results.} We show the CXRs generated by \textit{GazeDiff} based on the radiologists' transcript as text conditions. We show the generated pathology/object mentioned in the radiologists' transcript in \textcolor{red}{red box}.}}
  {\includegraphics[width=1.0\linewidth]{figures/midl2024_qualitative.pdf}}
\end{figure}
In \figureref{fig:qualitative_2}, we compare the generated CXRs from  \textit{GazeDiff} with different baselines like Stable Diffusion, ControlNet, and RoentGen for pneumonia and tuberculosis disease names as text conditions. Here, we observe that the \textit{GazeDiff} generates more realistic disease patterns when compared to the baselines. We also show the location of the generated disease patterns annotated by a radiologist (7 years experience). In \figureref{fig:qualitative_1}, we show the generated CXRs of the proposed method
%and baselines 
for different transcriptions as text conditions. We show the location of the generated disease patterns annotated in red.  \textit{GazeDiff} not only generates disease patterns/irregularities and devices (text highlighted in \textcolor{red}{red}) as mentioned in the transcript but also generates them in the mentioned location (text highlighted in \textcolor{SkyBlue}{blue}).
\begin{figure}[htbp]
\floatconts
  {fig:qualitative_2}
  {\caption{\textbf{Qualitative Comparisons.} We compare the CXRs generated by \textit{GazeDiff} with baselines based on a class-conditioned prompt. \textcolor{red}{red arrows/bounding box} show the generated pathology.
  }}
  {\includegraphics[width=0.9\linewidth]{figures/midl2024_2.pdf}}
\end{figure}