
\section{Multi-Stage Probabilistic Framework}

The proposed training framework implements a three-stage cascading approach to address few-shot learning challenges. The generative model undergoes progressive refinement through out-of-domain (OOD) pre-training, in-domain (ID) fine-tuning, and target-domain (TD) adaptation. This hierarchical strategy enables diffusion models to perform effectively with limited training data by facilitating knowledge transfer across domains. The complete training pipeline is detailed in Algorithm~\ref{alg:training}, with each stage's methodology and rationale examined in the following sections.



% TODO: We have not said whether a tilde a is known or not, whether the original model has to be linear for our thing to work. It can be a nonlinear forward model, in which case this is a very big assumption. What is the potential limitation? 

\paragraph{Problem Formulation.} 

Following~\citep{song2021solving}, we formulate super-resolution as a linear inverse problem to recover unknown signals $y$ from observed measurements $x$. Given scarce HR target data $\bm{x}_{\text{gt}}=\{x_i\}_{i=1}^M$, the corresponding LR target data $\bm{y}_{\text{gt}}=\{y_i\}_{i=1}^M$ is formulated as $\bm{y}_{\text{gt}} = \bm{A}\bm{x}_{\text{gt}}+\bm{\eta}$, where $\bm{A}$ denotes the linear downsampling matrix and $\bm{\eta}$ represents noise~\citep{song2021solving}.



To enable coarse-to-fine information flow, we incorporate two additional large-scale datasets: out-of-domain (OOD) data $\bm{x}_{\text{ood}}=\{x_i\}_{i=1}^{N_1}$ and in-domain (ID) data $\bm{x}_{\text{id}}=\{x_i\}_{i=1}^{N_2}$, where $N_1 > N_2 \gg M$. A bicubic degradation matrix $\tilde{\bm{A}}$ generates their LR counterparts: $\tilde{\bm{y}}_\text{ood} =\tilde{\bm{A}} \bm{x}_\text{ood}$ and $\tilde{\bm{y}}_\text{id} =\tilde{\bm{A}} \bm{x}_\text{id}$. 



\subsection{Training Stages } % We define OOD earlier; no need to restate it


\paragraph{Low-resolution Out-of-Domain Model Pre-Training.} This stage constructs a model for $p(\bm{x}_{\text{ood}}| \tilde{\bm{y}}_{\text{ood}})$ using abundant OOD data. We utilize COCO~\citep{COCO} data as $\bm{y}_{\text{ood}}$ with its LR counterparts $\tilde{\bm{y}}_{\text{ood}}$ to extract coarse-grained features. SR3~\citep{SR3} serves as the backbone model, ensuring framework generality across applications.


\paragraph{Low-resolution In-Domain ControlNet Pre-Training.} The ID stage leverages IXI~\citep{IXI} brain MRI datasets for model adaptation. We generate LR counterparts $\tilde{\bm{y}}_{\text{id}} = \tilde{\bm{A}} \bm{x}_{\text{id}}$ using downsampling matrix $\tilde{\bm{A}}$ for the low-resolution image in IXI data. Rather than simple fine-tuning, we integrate ControlNet~\citep{controlnet} by connecting the pre-trained diffusion model's U-Net to zero convolutional layers (Fig.~\ref{fig:overview}), enabling simultaneous in-domain knowledge acquisition $p(\bm{x}_{\text{id}} | \tilde{\bm{y}}_{\text{id}})$ and OOD information preservation.



\paragraph{High-resolution Target-Domain ControlNet Fine-Tuning.} To further fine-tune the system, the final stage aligns ControlNet~\citep{controlnet} with distribution $p(\bm{x}_{\text{gt}}| \bm{y}_{\text{gt}})$ using HR data $\bm{x}_{\text{gt}}$ (FastMRI~\citep{fastmri}, BrainTumor~\citep{braintumor}, OASIS~\citep{oasis}) and corresponding LR data $\bm{y}_{\text{gt}} = \bm{A}\bm{x}_{\text{gt}}$. Theoretically, $\bm{A}$ represents the true degradation process in medical imaging, which differs from the bicubic downsampling matrix $\tilde{\bm{A}}$ used in previous stages and is unknown to us. In practice, we approximate this true degradation by using bicubic downsampling to generate the target domain LR data $\bm{y}_{\text{gt}}$. Our experimental results demonstrate that this bicubic approximation achieves satisfactory performance in modeling the complex medical imaging degradation process.


\paragraph{ControlNet Integration} We adopt ControlNet~\citep{controlnet} in Out-of-Domain and In-Domain stages to enable efficient domain adaptation while preserving pre-trained knowledge. ControlNet creates a trainable copy of the encoding layers from the pre-trained U-Net, connected through zero-initialized convolutional layers. During training, the original U-Net weights remain frozen, while the ControlNet branch learns domain-specific features. The outputs from both branches are combined via element-wise addition, allowing the model to maintain general super-resolution capabilities from the OOD stage while acquiring medical imaging knowledge in other two stages. This architecture ensures stable training and prevents catastrophic forgetting of previously learned features.




\begin{figure*}[ht]
    \centering
    \includegraphics[width=0.95\textwidth]{./sec/fig/framework.png}
    \caption{Overview of the MSP-SR framework: a three-stage approach incorporating out-of-domain pre-training on COCO (16→64), ControlNet-assisted in-domain fine-tuning on IXI (16→64), and target-domain adaptation on medical datasets (64→256). Each stage uses bicubic downsampling with progressive resolution and domain transfer from natural to medical images.}
    \label{fig:overview}
\end{figure*}






\subsection{Conditional Generative Model (CGM)}



\paragraph{Gaussian Diffusion Process.}%%%%%%%%%%%%%%%


The backbone architecture is illustrated using the target-domain (TD) fine-tuning stage as an example. For an input image pair $\{\bm{x}:\bm{x}_{\text{gt}}, \bm{y}: \bm{y}_{\text{gt}}\}$, the model generates output $\{\bm{x}_0: \bm{x}_{\text{gt}}\}$ through the reverse diffusion process. The framework follows the conditional diffusion process and optimizes a neural denoising model that receives source image $\bm{y}$ and noisy target image $\bm{x}_t$ as inputs to produce the denoised image $\bm{x}_0$. 

Following DDPM~\citep{ddpm}, the unconditional diffusion process progressively adds Gaussian noise to the clean input $\bm{x}_0$ over $T$ iterations.


\begin{align}
  p( \bm{x}_{1:T} |  \bm{x}_0) &= \prod\nolimits_{t=1}^{T} p( \bm{x}_{t} |  \bm{x}_{t-1}) , \\
  p( \bm{x}_{t} |  \bm{x}_{t-1}) &= \mathcal{N}( \bm{x}_{t} | \sqrt{1-\beta_t}\,  \bm{x}_{t-1}, \beta_t \mathbf{I} ) . % y_{t}; 
\end{align}
where the parameter $\beta_{1:T}$ ($0 < \beta_t < 1$) determines the variance of the added noise. Let $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{t'=1}^t \alpha_{t'}$. The relationship between the noisy image $\bm{x}_t$ and the original image $\bm{x}_0$ can then be expressed as:
 \begin{align}
     \bm{x}_t = \sqrt{\bar \alpha_t}\,  \bm{x}_0 + \sqrt{1-\bar \alpha_t} \, \bm{\epsilon},
     \bm{\epsilon} \sim \mathcal{N}(\bm{0},\mathbf{I})~. \label{eq:fo}
 \end{align}





% To estimate the $p(\bm{x}_0|\bm{x}_t)$ in equation \ref{eq:bayes}, 
Recovering $\bm{x}_0$ from a Gaussian noise input $\bm{x}_T$ enables the generation of new samples. Although $p(\bm{x}_{t-1} | \bm{x}_t)$ approximates a Gaussian distribution when the noise variance $\beta_t$ is sufficiently small, directly estimating $p(\bm{x}_{t-1} | \bm{x}_t)$ remains intractable. Instead, when conditioned on $\bm{x}_0$, the inverse conditional probability $p(\bm{x}_{t-1} | \bm{x}_t, \bm{x}_0)$ becomes tractable as follows:

\begin{align}
p(\bm{x}_{t-1}|\bm{x}_t,\bm{x}_0)=\mathcal{N}(\bm{x}_{t-1};\mu(\bm{x}_t,\bm{x}_0),\tilde \beta_t \mathbf{I}),
\label{eq:deonise}
\end{align}
where
\begin{align}
\mu(\bm{x}_t,\bm{x}_0)&=\frac{1}{\sqrt{\alpha_t}}\left(\bm{x}_t-\frac{1-\alpha_t}{\sqrt{1-\bar \alpha_t}} \bm{\epsilon}_t\right) \label{eq:mu}, \\
\tilde \beta_t &= \frac{1 - \bar \alpha_{t-1}}{1-\bar \alpha_t} \cdot \beta_t. \nonumber
\end{align}


Here, $\mu(\bm{x}_t, \bm{x}_0)$ represents the mean of the Gaussian distribution derived from the noisy image $\bm{x}_t$ and clean image $\bm{x}_0$, where $\bm{\epsilon}_t$ is the forward process noise that enables effective denoising through appropriate scaling with $\alpha_t$ and $\bar{\alpha}_t$.

Building upon this foundation, our framework extends to a conditional diffusion model that conditions on the low-resolution input $\bm{y}$. This conditioning provides crucial prior information during the early stages of the diffusion denoising process, guiding the generation towards semantically consistent outputs. The objective here is to learn a diffusion model $q_{\theta}$ to approximate the inverse conditional probability as follows~\citep{SR3}:
\begin{align}
q(\bm{x}_T)&=\mathcal{N}(\mathbf{0},\mathbf{I}), \\
q_{\theta}(\bm{x}_{t-1}|\bm{x}_t,\bm{y})&=\mathcal{N}(\bm{x}_{t-1};\mu_{\theta}(\bm{x}_t,t,\bm{y}),\tilde{\beta}_t \mathbf{I}), \\
q_{\theta}(\bm{x}_{0:T} |\bm{y})&=q(\bm{x}_T) \prod_{t=1}^T q_{\theta}(\bm{x}_{t-1}|\bm{x}_t,\bm{y}).
\end{align}

% Ideally, we want $\mu_{\theta}(\bm{x}_t,t,\bm{y})$ to output $\mu$ in Equation \ref{eq:mu}. Since $\bm{x}_t$ is known, we just need to predict the inner $\bm{\epsilon}_t$(noise). Therefore, $\mu_{\theta}(\bm{x}_t,t,\bm{y})$ is parameterized via te noise predictor $\bm{\epsilon}_{\theta}(\bm{x}_t,t,\bm{y})$ as

Ideally, we want $\mu_{\theta}(\bm{x}_t,t,\bm{y})$ to output the conditional equivalent of $\mu$ in Equation~\ref{eq:mu}. While Equation~\ref{eq:mu} describes the unconditional case, our conditional framework requires the mean to be guided by the conditioning input $\bm{y}$. Since $\bm{x}_t$ is known during inference, we only need to predict the noise term $\bm{\epsilon}_t$ conditioned on $\bm{y}$. Therefore, $\mu_{\theta}(\bm{x}_t,t,\bm{y})$ is parameterized via the conditional noise predictor $\bm{\epsilon}_{\theta}(\bm{x}_t,t,\bm{y})$ as


\begin{align}
\mu_{\theta}(\bm{x}_t,t,\bm{y})=\frac{1}{\sqrt{\alpha_t}}\left(\bm{x}_t-\frac{1-\alpha_t}{\sqrt{1 - \bar \alpha_t}}\bm{\epsilon}_{\theta}(\bm{x}_t,t, \bm{y})\right).
\label{eq:mu-eps}
\end{align}







\subsection{Model Consistency}


Drawing inspiration from the consistency model~\cite{song2023consistency}, we enhance input-output correspondence by introducing a consistency loss $l_\text{CON}$ alongside the standard reconstruction loss $l_\text{GT}$ used in diffusion models to train $\bm{\epsilon}_{\theta}$. While consistency models ensure temporal consistency across different timesteps in the diffusion trajectory, our consistency loss enforces correspondence between the input low-resolution image and the generated high-resolution output through a degradation process.



Following~\citep{ddpm}, the reconstruction loss $l_\text{GT}$ optimizes the conditional model $\mu_{\theta}$ rather than $\bm{\epsilon}_{\theta}$ (based on Equation~\ref{eq:mu-eps}), minimizing a variant of the ELBO with true image $x_0$ and conditioning $y$ as inputs.



% To better match model outputs to their corresponding inputs, we draw inspiration from CycleGAN~\citep{cyclegan} and introduce an additional loss function, denoted as $l_\text{CON}$, to increase the model's consistency with target data. This is incorporated alongside the original reconstruction loss, denoted as $l_\text{GT}$, that is used in diffusion models to train $\bm{\epsilon}_{\theta}$. 

% Following existing literature~\citep{ddpm}, recall that the reconstruction loss function $l_\text{GT}$ learns the conditional model $\mu_{\theta}$, rather $\bm{\epsilon}_{\theta}$ based on \eqref{eq:mu-eps}, to minimize a variant
% of the ELBO with true image $x_0$ and the conditioning $y$ as inputs:
\begin{align}
l_\text{GT} = \mathbb{E}_{t, \bm{x}_0, \bm{\epsilon}} \left[\|\bm{\epsilon} - \bm{\epsilon}_{\theta}(\sqrt{\bar{\alpha}_t}\bm{x}_0 + \sqrt{1-\bar{\alpha}_t}\bm{\epsilon}_t, t, \bm{y})\|^2 \right].  \label{eq:l_ori}
\end{align}

For understanding the consistency loss $l_\text{CON}$, assume we are at $t$ step now. The denoised clean image of noisy image $\bm{x}_t$ at $t$ step will be denoted as $\hat{\bm{x}}_{0,t}$, and can be crudely obtained from Equation~\ref{eq:fo} as

\begin{align}
\bm{\hat{x}}_{0,t} = \frac{\bm{x_t} - \sqrt{1-\bar \alpha}\hat{\bm{\epsilon}}}{\sqrt{\bar \alpha}}~,
\end{align}

where $\hat{\bm{\epsilon}}=\epsilon_\theta(\bm{x_t}, t ,\bm{y})$ is the predicted noise at $t$ step, and $\bm{x}_t$ is the noisy image of target image $\bm{x}_{\text{gt}}$.


Then, if we construct $\tilde{\bm{A}} \bm{\hat{x}}_{0,t}$, i.e., downsample $\bm{\hat{x}}_{0,t}$ with the bicubic downsample operator $\tilde{\bm{A}}$, then we expect that to be close to the LR image $\bm{y}$, i.e., $\tilde{\bm{A}} \bm{\hat{x}}_{0,t} \approx \bm{y}$. Then, taking expectation over all $t$, we get the consistency loss
and the total combined loss of diffusion model as:
\begin{align}
l_\text{CON} = \mathbb{E}_t [ \| \tilde{\bm{A}}\bm{\hat{x}}_{0,t} - \bm{y} \|^2],\\
L= \gamma l_\text{GT} + (1-\gamma)l_\text{CON}.
\end{align}

where $\gamma$ is an adjustable hyperparameter manually set to 0.5 in experiments for convenience, the optimal value can be obtained by parameter search for further study.



\begin{algorithm}[H]
\caption{Training a Denoising Model $\mu_\theta$}\label{alg:training}
\begin{algorithmic}[1]
    \State \textbf{Input:} Datasets: $ D_{\text{ood}}(\bm{y}_{\text{ood}},\tilde{\bm{y}}_{\text{ood}} ),$ 
    \State \hspace{2.8em} $ D_{\text{id}}(\bm{y}_{\text{id}},\tilde{\bm{y}}_{\text{id}}),$ 
    \State \hspace{2.8em} $D_{\text{td}}(\bm{x}_{\text{td}},\bm{y}_{\text{td}})$
    \State \textbf{Output:} Trained model $\mu_\theta$
    \For{$d$ \textbf{in} Datasets}
         \State $\bm{x} = \bm{y}_{\text{ood}},\bm{y} = \tilde{\bm{y}}_{\text{ood}}$ \textbf{if} {$d \in D_{\text{ood}}$}
         \State $\bm{x} = \bm{y}_{\text{id}},\bm{y} = \tilde{\bm{y}}_{\text{id}} $ \textbf{if} {$d \in D_{\text{id}}$}
         \State $\bm{x} = \bm{x}_{\text{td}},\bm{y} = \bm{y}_{\text{td}}$ \textbf{if} {$d \in D_{\text{td}}$}
        
        \While{not converged}
           \State $t \sim$ Uniform$(1,\ldots,T)$
           \State $(\bm{x_0}, \bm{y}) \sim p(\bm{x}, \bm{y})$
            \State $\bar \alpha \sim p(\bar \alpha)$
            \State $\epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$
            \State $\bm{x_t} = \sqrt{\bar \alpha} \bm{x_0} + \sqrt{1 - \bar \alpha} \bm{\epsilon}$
            \State $\hat{\bm{\epsilon}} = \bm{\epsilon}_\theta(\bm{x_t}, t ,\bm{y})$
            \State $l_{\text{GT}} = \| \hat{\bm{\epsilon}} - \bm{\epsilon} \|^2$
            \State $\bm{\hat{x}}_{0,t} = \frac{\bm{x_t} - \sqrt{1-\bar \alpha}\hat{\epsilon}}{\sqrt{\bar \alpha}}$
            \State $l_{\text{CON}} = \| \tilde{\bm{A}}\bm{\hat{x}}_{0,t} - \bm{y} \|^2$
            \State Take gradient descent w.r.t. $\theta$: $\nabla_\theta [\gamma l_\text{GT} + (1-\gamma) l_\text{CON}]$ using Adam optimizer
        \EndWhile
    \EndFor
\end{algorithmic}
\end{algorithm}




\begin{algorithm}[H]
\caption{Sampling $\mu_\theta$ in $T$ steps}
\begin{algorithmic}[1]
    \Repeat
        \State $\bm{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$
        \For{$t=T, \dotsc, 1$}
          \State $\bm{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ if $t > 1$, else $\bm{z} = \mathbf{0}$
          \State  $\bm{x}_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left( \bm{x}_t - \frac{1-\alpha_t}{\sqrt{1-\bar \alpha_t}} \mu_\theta(\bm{y}, \bm{x}_t, \gamma_t) \right) + \sqrt{1 - \alpha_t} \bm{z}$
         \EndFor
    \Until{converged}
\end{algorithmic}
\end{algorithm}



% \begin{algorithm}[H]
% \caption{Training a Denoising Model $\mu_\theta$}\label{alg:training}
% \begin{algorithmic}[1]
%     \State \textbf{Input:} Datasets: $ D_{\text{ood}}(\bm{y}_{\text{ood}},\tilde{\bm{y}}_{\text{ood}} ),$ 
%     \State \hspace{2.8em} $ D_{\text{id}}(\bm{y}_{\text{id}},\tilde{\bm{y}}_{\text{id}}),$ 
%     \State \hspace{2.8em} $D_{\text{td}}(\bm{x}_{\text{td}},\bm{y}_{\text{td}})$
%     \State \textbf{Output:} Trained model $\mu_\theta$
%     \For{$d$ \textbf{in} Datasets}
%          \State $\bm{x} = \bm{y}_{\text{ood}},\bm{y} = \tilde{\bm{y}}_{\text{ood}}$ \textbf{if} {$p \in D_{\text{ood}}$}
%          \State $\bm{x} = \bm{y}_{\text{id}},\bm{y} = \tilde{\bm{y}}_{\text{id}} $ \textbf{if} {$p \in D_{\text{id}}$}
%          \State $\bm{x} = \bm{x}_{\text{td}},\bm{y} = \bm{y}_{\text{td}}$ \textbf{if} {$p \in D_{\text{td}}$}
        
%         \Repeat ~~ \n $t \sim$ Uniform$(1,\ldots,T)$
%            \State $(\bm{x_0}, \bm{y}) \sim p(\bm{x}, \bm{y})$
%             \State $\bar \alpha \sim p(\bar \alpha)$
%             \State $\epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$
%             \State $\bm{x_t} = \sqrt{\bar \alpha} \bm{x_0} + \sqrt{1 - \bar \alpha} \bm{\epsilon}$
%             \State $\hat{\bm{\epsilon}} = \bm{\epsilon}_\theta(\bm{x_t}, t ,\bm{y})$
%             \State $l_{\text{GT}} = \| \hat{\bm{\epsilon}} - \bm{\epsilon} \|^2$
%             \State $\bm{\hat{x}}_{0,t} = \frac{\bm{x_t} - \sqrt{1-\bar \alpha}\hat{\epsilon}}{\sqrt{\bar \alpha}}$
%             \State $l_{\text{CON}} = \| \tilde{\bm{A}}\bm{\hat{x}}_{0,t} - \bm{y} \|^2$
%             \State Take a weighted gradient descent step: $\nabla_\theta [\gamma l_\text{GT} + (1-\gamma) l_\text{CON}]$ using Adam optimizer
%         \Until{converged}
%     \EndFor
% \end{algorithmic}
% \end{algorithm}


