\vspace{-.5in}
\section{Methodology} 
\figureref{fig:main} shows the pipeline on the \textit{GazeDiff} architecture. \textit{GazeDiff} consists of three major components: the Stable Diffusion (SD-CXR) model~\cite{rombach2022high} 
(shown in \textcolor{LightYellow}{yellow} and \textcolor{LightRed}{red}), the ControlNet-Gaze (CN-Gaze model) that enhances the SD-CXR with more radiologists' eye gaze patterns as additional controls,  (shown in \textcolor{LightGreen}{green}), and 
finally the class conditioned density estimates are calculated from the 
CN-Gaze model,
(shown in \textcolor{LightBlue}{blue}) 
First, we provide a preliminary overview of diffusion models. In Section \ref{gaze_as_control}, we propose a method to use radiologists' eye gaze patterns as additional controls for text-to-image diffusion models, and in Section \ref{zero_shot_classification}, we discuss a technique to use diffusion models as zero-shot classifiers.

\noindent\textbf{Preliminary.} Diffusion probabilistic models~\cite{ho2020denoising} or diffusion models are generative models with a parameterized Markov chain trained using variational inference. Let us consider an input image $x$, the diffusion or forward process (shown in \textcolor{LightYellow}{yellow} in \figureref{fig:main}) is a fixed Markov process that adds Gaussian noise $\epsilon\sim\mathcal{N}(\mu, \sigma^2)$ to $x$, according to a variance schedule $\beta=\{\beta_1,...,\beta_T\}$, shown as $q(x_{1:T}|x_0):=\prod q(x_t|x_{t-1})$, where $q(x_t|x_{t-1}:=\mathcal{N}(x_t;\sqrt{1-\beta_t}x_{t-1};\beta_t I))$. And, the reverse process(shown in \textcolor{LightRed}{red} in \figureref{fig:main}) is a learned Gaussian transition to denoise $x$, which can be conditioned on a variable $c$, shown as $p_{\theta}(x_{0:T}):=p(x_T)\prod p_{\theta}(x_{t-1}|x_t)$. In our case, $x$ is a CXR image, $c$ is the radiologist's findings (radiologist's transcripts and disease labels), and $T$ is the number of timesteps. So, diffusion models define $x_0$ conditioned on $c$ as $p_\theta(x_0|c)=\int_{x_{1:T}}p(x_T)\prod p_\theta(x_{t-1}|x_t,c)dx_{1:T}$ with $p_\theta(x_{t-1}|x_t):=\mathcal{N}(x_{t-1};\mu_\theta(x_t, t), \sigma_\theta(x_t,t))$. Now, the diffusion model is trained to minimize the \textit{variational lower bound} (ELBO) of the log-likelihood, defined as,
\begin{equation}
\label{first_equation}
    \log p_\theta(x_0|c)\geq\mathbb{E}_q\left[\log\frac{p_\theta(x_{0:T}, c)}{q(x_{1:T}|x_0)}\right]
\end{equation}
First, we train the SD with CXR images and we term this model SD-CXR. Then, radiologists view a CXR image $\mathcal{I}$ and generate eye gaze patterns $\mathbb{G}$, discussed in detail in \ref{gaze_as_control}. Similar to ~\cite{bhattacharya2022radiotransformer,bhattacharya2022gazeradar}, Human Visual Attention(HVA) maps are computed from $\mathbb{G}$. In this work, we compute separate Focal HVA and Global HVA maps. The focal HVA captures fine-grained disease-relevant features while the global HVA captures coarse disease-relevant features; a weighted combination of these maps captures the entire feature space of disease-relevant regions, discussed in detail in Appendix \ref{appendix_hva}.
\subsection{Gaze as an additional control for Text-to-Image diffusion} 
\label{gaze_as_control}
Here, we discuss CN-Gaze, in which the radiologists' eye gaze patterns are injected as additional conditions into the SD-CXR model. Let us assume that the SD-CXR model is $\mathcal{F}(.)$. We represent radiologists' eye gaze patterns as $\mathbb{G}_r\in R^{(\mathbb{F}, \mathcal{T})}$, where $r\in\{1, 2, ..., \mathcal{R}\}$. Here, $\mathbb{F}$ are the eye gaze fixations over time $\mathcal{T}$ and $\mathcal{R}$ is the number of radiologists whose eye gaze are collected for CXR image $\mathcal{I}\in\mathbb{R}^{H\times W\times C}$ with $\{H, W, C\}$ are height, width, and number of channels of the image. Now, HVA edge maps are computed for Global and Focal HVA from $\mathbb{G}$, represented as $\mathbb{G}'$ (discussed in detail in the Appendix \ref{appendix_hva}). Similar to ~\cite{zhang2023adding}, for training with additional control $\mathbb{G}'$, we freeze $\mathcal{F}(.)$ with initial parameters represented as $\Theta$ and clone the frozen model parameters into a \textit{trainable} model to train with the gaze condition, shown as $\Theta_\mathbb{G}'$. The input $\mathcal{I}$ is fed to both $\mathcal{F}_\Theta(.)$ and $\mathcal{F}_{\Theta_\mathbb{G}'}(.)$ in a manner where $\mathcal{F}_{\Theta_\mathbb{G}'}$ blocks are connected to $\mathcal{F}_\Theta$ blocks through \textit{zero convolution}(represented as $\mathbb{Z}(.)$, which is a $Conv_{1\times1}$ layer with $W=0$ and $b=0$) layers, as shown in \figureref{fig:main}. The output $y_\mathbb{G}'$ is shown as $y_\mathbb{G}=\mathcal{F}_\Theta(\mathcal{I})+\mathbb{Z}_2(\mathcal{F}_{\Theta_\mathbb{G}}(\mathcal{I}+\mathbb{Z}_1(\mathbb{G})))$. In our case, the radiologist's eye gaze patterns are represented as two separate entities, namely, Focal HVA and Global HVA. Hence, two separate ControlNets are trained, $\mathcal{F}_{\Theta_f}$, and $\mathcal{F}_{\Theta_g}$. The resulting outputs from these ControlNets are added with no extra weighting or linear interpolation to make it a Multi-ControlNet $\mathcal{F}_{\Theta_f}$. Hence, $y_\mathbb{G}'$ can be represented as a weighted combination of focal-conditioned, $y_f$ and global-conditioned, $y_g$, shown as,
\begin{equation}
\label{second_equation}
    y_\mathbb{G}= \lambda_1 y_f+\lambda_2 y_g, \lambda_1, \lambda_2 \in \mathbb{R}^+,
    \begin{rcases}
        y_f=\mathcal{F}^1_\Theta(\mathcal{I})+\mathbb{Z}^1_2(\mathcal{F}^1_{\Theta_{\mathbb{G}_f}}(\mathcal{I}+\mathbb{Z}^1_1(\mathbb{G}_f)))\\
        y_g=\mathcal{F}^2_\Theta(\mathcal{I})+\mathbb{Z}^2_2(\mathcal{F}^2_{\Theta_{\mathbb{G}_g}}(\mathcal{I}+\mathbb{Z}^2_1(\mathbb{G}_g)))
    \end{rcases}
    \text{HVA}
\end{equation}
\subsection{Zero-Shot classification} 
\label{zero_shot_classification}In common medical scenarios, during real-time inference, radiologists' eye gaze patterns and transcripts are not available. Here we discuss how CN-Gaze is used for for zero-shot classification. Given each noised sample $x_t=\sqrt{\alpha_t}x+\sqrt{1-\alpha_t}\epsilon$, diffusion model learns $\epsilon_\theta(x_t,c)$. Using this parameterization, \equationref{first_equation} can be rewritten as, $-\mathbb{E}[\sum^T_{t=2}w_t\norm{\epsilon-\epsilon_\theta(x_t,c)}^2-$\\$\log p_\theta(x_0|x_1,c)]+C$. From ~\cite{li2023your}, assuming $w_t=1$ and $\log p_\theta(x_0|x_1,c)\approx0$ as $T=1000$ is large, the simplified ELBO term is represented as $-\mathbb{E}_{t,\epsilon}\left[\norm{\epsilon-\epsilon_\theta(\mathcal{I}_t,c)}^2\right]+C$.\\
Now, classification tasks using generative models can be defined using Bayes Theorem as $p_\theta(c_i|\mathcal{I})=\frac{p(c_i)p_\theta(\mathcal{I}|c_i)}{\sum_jp(c_j)p_\theta(\mathcal{I}|c_j)}$, where $c_i$ is the label, and $\mathcal{I}$ is the input image. Using the simplified ELBO term, it can be re-written as $p_\theta(c_i|\mathcal{I})=\frac{exp\big(-\mathbb{E}_{t,\epsilon}\left[\norm{\epsilon-\epsilon_\theta(\mathcal{I}_t,c_i)}^2\right]\big)}{\sum_jexp\big(-\mathbb{E}_{t,\epsilon}\left[\norm{\epsilon-\epsilon_\theta(\mathcal{I}_t,c_j)}^2\right]\big)}$. In our case, from \equationref{second_equation}, $p_\theta(c_i|\mathcal{I})$ can be rewritten as $p_\Theta(c_{(f\oplus g)_i}|\mathcal{I})$, shown as
\begin{equation}
\label{third_equation}
    p_{\Theta}(c_{(f\oplus g)_i}|\mathcal{I}) = \frac{exp\bigg(-\mathbb{E}_{t,\epsilon}\left[\norm{\epsilon-\epsilon_{\Theta_{(f\oplus g)}}(\mathcal{I}_t,c_{(f\oplus g)_i})}^2\right]\bigg)}{\sum_jexp\bigg(-\mathbb{E}_{t,\epsilon}\left[\norm{\epsilon-\epsilon_{\Theta_{(f\oplus g)}}(\mathcal{I}_t,c_{(f\oplus g)_j})}^2\right]\bigg)}
\end{equation}
Then, an unbiased Monte Carlo estimate is calculated for each expectation by sampling $N(t_i, \epsilon_i)$, shown as $\frac{1}{N}\sum_{i=1}^{N}\norm{\epsilon_i -\epsilon_\Theta(\sqrt{\alpha_t}\mathcal{I}+\sqrt{1-\alpha_t}\epsilon_i, c_j)}^2$. Now, plugging this formulation into \equationref{third_equation} makes the zero-shot \textit{GazeDiff} classifier.
