

% The length of main submissions is strictly limited, as already described. Authors may optionally submit a technical appendix (PDF) containing additional supporting information such as proofs of theorems that are stated in the main paper; additional information needed to reproduce experiments; further experimental results; figures and examples to illustrate technical claims; etc. The main submission may reference the supplementary material, but should be self contained. Reviewers will be instructed to make their acceptance evaluations based on the main submission, and will not be obliged to consult the supplementary material. If proofs or other supplementary matter are an important part of the contribution, their essential elements should be included in the main paper.


% \newpage
% \section{Appendix}

In the supplementary material, we present additional qualitative results for the MSP-SR framework. We also include detailed information on the model architecture and comprehensive descriptions of the datasets used.



\section{Dataset Description}
%%intro the MRI dataset?



% detailly introduce them, include the content, the MRI image type. maybe data sample?






\paragraph{OOD Pre-Training Stage.}
The initial stage utilizes the COCO dataset~\cite{COCO}, a comprehensive collection of over 330,000 images, including 200,000 labeled images across 80 object categories. We selected 100,000 images for OOD pre-training, with each image cropped to a square format based on its smaller dimension.


\paragraph{ID Fine-tuning Stage.}
The second stage employs the IXI dataset~\cite{IXI}, which contains brain MRI scans from London hospitals. While the dataset includes T1, T2, PD-weighted, diffusion-weighted, and MR angiography images, we specifically utilize T2-weighted longitudinal and transversal brain scans for fine-tuning.



\paragraph{TD Fine-tuning Stage.}
The final stage incorporates three target-domain datasets: FastMRI~\cite{fastmri}, BrainTumor~\cite{braintumor}, and OASIS~\cite{oasis}. FastMRI, developed by Facebook AI Research and NYU Langone Health, provides k-space data and high-resolution reconstructed images, from which we use T2-weighted longitudinal scans. The BrainTumor dataset contains multi-modal MRI scans (T1, T2, FLAIR) of brain tumor patients with tumor type annotations; we utilize its T1/T2-weighted longitudinal scans. From the OASIS-2 longitudinal Alzheimer's study dataset, we select MRI scans from cognitively stable subjects as our target data.



\section{Training and Evaluation Details}

%gpu type 
Our training process consists of three stages: initial training on the out-of-domain (COCO) dataset, followed by fine-tuning on the in-domain (IXI) dataset and target-domain datasets (FastMRI, BrainTumor, OASIS). Input resolutions are set to 64×64 for COCO and IXI datasets, and 256×256 for target-domain datasets. All experiments are conducted on an NVIDIA A100-SXM4-40GB GPU with CUDA 12.2. The OOD pre-training and ID fine-tuning stages each require 1 million iterations, consuming approximately 24 hours per stage on a single A100 GPU, with the best-performing epoch selected for subsequent analysis. Target-domain fine-tuning continues for 60k-80k iterations until convergence. To maintain consistency with the baseline architecture, we adopt training hyperparameters similar to those in SR3~\cite{SR3}, as detailed in Table~\ref{tab:training_config}.


%training data .?
%add a training hyparameter table

% batchsize
% lr
% optimizer
% iteration

\begin{table*}[ht]
\centering
\begin{tabular}{c@{\hspace{0.3cm}}|c@{\hspace{0.25cm}}|c@{\hspace{0.3cm}}|c@{\hspace{0.2cm}}|c@{\hspace{0.25cm}}|c@{\hspace{0.3cm}}|c@{\hspace{0.15cm}}}
\hline
\textbf{} & \textbf{Batchsize} & \textbf{Iteration} & \textbf{LR} & \textbf{Dropout} & \textbf{Resolution} & \textbf{Opt.} \\ \hline
COCO & \multirow{5}{*}{4} & 1000000 & \multirow{5}{*}{1e-4} & \multirow{5}{*}{0.2} & \multirow{2}{*}{16 $\to$ 64} & \multirow{5}{*}{Adam} \\ \cline{1-1} \cline{3-3}
IXI &  & 1000000 &  &  &  &  \\ \cline{1-1} \cline{3-3} \cline{6-6}
FastMRI &  & 70000 &  &  & \multirow{3}{*}{64 $\to$ 256} &  \\ \cline{1-1} \cline{3-3}
BrainTumor &  & 70000 &  &  &  &  \\ \cline{1-1} \cline{3-3}
OASIS &  & 70000 &  &  &  &  \\ \hline
\end{tabular}
\caption{Training Configuration for Various Datasets for 4$\times$ scale SR.}
\label{tab:training_config}
\end{table*}





%%%?
%introduce LPIPS detail
\paragraph{LPIPS} We evaluate the perceptual quality of generated super-resolution images using Learned Perceptual Image Patch Similarity (LPIPS)~\cite{lpips}, a neural network-based metric that captures perceptually meaningful image differences more effectively than traditional pixel-based measures. Our implementation utilizes the VGG backbone for LPIPS computation, with evaluations conducted across 60 test samples from each target dataset.



\section{Additional Experiment Results}



\subsection{Additional Visual Results for OASIS and BrainTumor}
Table~\ref{tab:main_exp} presents performance comparisons between MSP-SR and its ablated variants on the FastMRI dataset. Supplementary visual results for these experiments are provided in Figures~\ref{fig:app_2x}, \ref{fig:app_4x}, and \ref{fig:app_8x}. Additionally, Figure~\ref{fig:other_dataset} demonstrates visual results from OASIS~\cite{oasis} and BrainTumor~\cite{braintumor} datasets, complementing the quantitative analysis in Table~\ref{table:other_dataset}.


\subsection{Uncertainty Quantification Analysis}
Figure~\ref{fig:zoomin} presents the complete sample used for uncertainty analysis in Figures~\ref{fig:heatmap1} and \ref{fig:heatmap2}. The visualization consists of two components: a ground truth MRI image with a highlighted region of interest (left), and sets of inference samples from three training configurations (right). These configurations include: (1) OOD pre-training + ID fine-tuning + TD fine-tuning, (2) OOD pre-training + TD fine-tuning, and (3) TD-only training, each represented by five inference samples.


\begin{figure*}[ht]
  \centering
  \includegraphics[width=\textwidth]{./sec/fig/ap_zoomin.png}
  \caption{Visualization of Multiple 4$\times$ Inference Results Across Different Training Configurations.}
  \label{fig:zoomin}
\end{figure*}





%%results for 2x,4x,8x framework

\begin{figure*}[ht]
  \centering
  \includegraphics[width=\textwidth]{./sec/fig/app_2x.png}
  \caption{More visualized samples for different frameworks on 2$\times$ scale SR task.}
  \label{fig:app_2x}
\end{figure*}


\begin{figure*}[ht]
  \centering
  \includegraphics[width=0.9\textwidth]{./sec/fig/app_4x.png}
  \caption{More visualized samples for different frameworks on 4$\times$ scale SR task.}
  \label{fig:app_4x}
\end{figure*}


\begin{figure*}[ht]
  \centering
  \includegraphics[width=0.9\textwidth]{./sec/fig/app_8x.png}
  \caption{More visualized samples for different frameworks on 8$\times$ scale SR task.}
  \label{fig:app_8x}
\end{figure*}

\begin{figure*}[ht]
  \centering
  \includegraphics[width=0.9\textwidth]{./sec/fig/other_dataset.png}
  \caption{Visualized samples for experiments on OASIS and BrainTumor datasets.}
  \label{fig:other_dataset}
\end{figure*}



\section{Model Architecture}
In this section, we present our detailed model architecture. We use the same encoder-decoder architecture for denoising Unet in all three stages. We build our model on top of SR3\cite{SR3} and ControlNet\cite{controlnet}. Our model contains 3 components: the encoder, the decoder, and the ControlNet.

% We modified the model architecture from VQGAN\cite{VQGAN}. To make sure our input and output resolution of the BEV map is $200 \times 200$, and the latent space resolution is $12 \times 12$, we adjust the padding strategy in the auto-encoder model.

\paragraph{Encoder}
For input noise level t, the encoder uses positional encoding and 2 MLP layers to produce embedding for the noise level. The encoder contains an input convolution layer followed by five down-sampling blocks, each containing two res-net blocks. One of the down-sampling blocks contains self-attention. After down-sampling, the encoder has two middle res-net blocks and one middle attention block. We show detailed model architecture in Tab.~\ref{tab:Encoder}

\begin{table*}[!t]
\footnote{}
\caption{Architecture for Encoder}
\resizebox{1.0\textwidth}{!}{
\begin{tabular}{|c|c|c|c|c|c|c|}
\hline
\multicolumn{2}{|c|}{layers} & \multicolumn{5}{|c|}{parameters} \\
\hline

\multirow{4}*{noise\_level\_t} & PositionalEncoding & \multicolumn{5}{|c|}{ t = t * exp(-log(1e4) * arange(64)); encoding = (sin(t), cos(t)) } \\
~ & mlp & \multicolumn{5}{|c|}{in\_ch:64, out\_ch: 256} \\
~ & Swish activation & \multicolumn{5}{|c|}{x*signoid(x)} \\
~ & mlp & \multicolumn{5}{|c|}{in\_ch:256, out\_ch: 64} \\

 \hline
\multirow{1}*{input} & Conv2d & \multicolumn{5}{|c|}{in\_ch:6, out\_ch: 64, kernel: 3x3, stride: 1, pad: 1 } \\
 \hline
\multirow{3}*{downsample\_block\_1} &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:64, out\_ch: 64 } \\
~ &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:64, out\_ch: 64 } \\
~ &  Downsample(Conv2d) & \multicolumn{5}{|c|}{in\_ch:64, out\_ch: 64,kernel:3x3, stride:2, padding=((0,1,0,1),val=0) } \\
\hline
\multirow{3}*{downsample\_block\_2} &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:64, out\_ch: 128 } \\
~ &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:128, out\_ch: 128 } \\
~ &  Downsample(Conv2d) & \multicolumn{5}{|c|}{in\_ch:128, out\_ch: 128,kernel:3x3, stride:2, padding=((0,1,0,1),val=0) } \\
\hline
\multirow{3}*{downsample\_block\_3} &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:128, out\_ch: 256 } \\
~ &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:256, out\_ch: 256 } \\
~ &  Downsample(Conv2d) & \multicolumn{5}{|c|}{in\_ch:256, out\_ch: 256,kernel:3x3, stride:2, padding=((0,1,0,1),val=0) } \\
\hline
\multirow{5}*{downsample\_block\_4} &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:256, out\_ch: 512 } \\
~ &  SelfAtt & \multicolumn{5}{|c|}{in\_ch:512, out\_ch: 512 } \\
~ &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:512, out\_ch: 512 } \\
~ &  SelfAtt & \multicolumn{5}{|c|}{in\_ch:512, out\_ch: 512 } \\
~ &  Downsample(Conv2d) & \multicolumn{5}{|c|}{in\_ch:512, out\_ch: 512,kernel:3x3, stride:2, padding=((0,1,0,1),val=0) } \\
\hline
\multirow{2}*{downsample\_block\_5} &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:512, out\_ch: 512 } \\
~ &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:512, out\_ch: 512 } \\

\hline
\multirow{3}*{middle} &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:512, out\_ch: 512 } \\
~ &  AttnBlock & \multicolumn{5}{|c|}{in\_ch:512 } \\
~ &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:512, out\_ch: 512 } \\
\hline
% \multirow{3}*{end} &  Normalize & \multicolumn{5}{|c|}{GroupNorm,num\_groups=32, num\_channels=512} \\

% ~ &  Activation & \multicolumn{5}{|c|}{x*sigmoid(x)} \\

% ~ &  Conv2d & \multicolumn{5}{|c|}{in\_ch:512, out\_ch: 256, kernel: 3x3, stride: 1, pad: 1 } \\
% \hline
\end{tabular}
}
\label{tab:Encoder}
\end{table*}

\paragraph{Decoder}
The decoder has a similar structure as the encoder. It includes five upsampling blocks, each containing three res-net blocks, one of which has self-attention following each resnet block. The decoder also applies an output convolution layer. We show the detailed model architecture in Tab.~\ref{tab:decoder}


\paragraph{ControlNet}
Following ControlNet\cite{controlnet}, we use an additional branch of the network for better finetuning. For the ControlNet branch, we use exactly the same architecture as the Encoder, except for the additional zero convolution blocks. The initial weight of the ControlNet branch is copied from the Encoder, and the zero conv layers are initialized to output all zeros. The outputs in the ControlNet branch after each zero-convolutions are added to each input of the decoder's upsampling block. We show the detailed model architecture in Tab.~\ref{tab:ControlNet}.


\begin{table*}[!t]
\caption{Architecture for Decoder}
\resizebox{1.0\textwidth}{!}{
\begin{tabular}{|c|c|c|c|c|c|c|}
\hline
\multicolumn{2}{|c|}{layers} & \multicolumn{5}{|c|}{parameters} \\
% \hline
% \multirow{1}*{input} & Conv2d & \multicolumn{5}{|c|}{in\_ch:256, out\_ch: 512, kernel: 3x3, stride: 1, pad: 1 } \\

% \hline
% \multirow{3}*{middle} &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:512, out\_ch: 512 } \\
% ~ &  AttnBlock & \multicolumn{5}{|c|}{in\_ch:512 } \\
% ~ &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:512, out\_ch: 512 } \\
\hline


\multirow{5}*{upsample\_block\_1} &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:1024, out\_ch: 512 } \\
~ &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:512, out\_ch: 512 } \\
~ &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:512, out\_ch: 512 } \\
~ &  Upsample(nearest\_interpolate) & \multicolumn{5}{|c|}{scale\_factor=2.0 } \\
~ &  Conv2d & \multicolumn{5}{|c|}{in\_ch:512, out\_ch: 512,kernel=3x3,stride=1,padding=1) } \\
\hline
\multirow{5}*{upsample\_block\_2} &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:1024, out\_ch: 512 } \\
~ &  SelfAtt & \multicolumn{5}{|c|}{in\_ch:512, out\_ch: 512 } \\
~ &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:512, out\_ch: 512 } \\
~ &  SelfAtt & \multicolumn{5}{|c|}{in\_ch:512, out\_ch: 512 } \\
~ &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:512, out\_ch: 512 } \\
~ &  SelfAtt & \multicolumn{5}{|c|}{in\_ch:512, out\_ch: 512 } \\
~ &  Upsample(nearest\_interpolate) & \multicolumn{5}{|c|}{scale\_factor=2.0 } \\
~ &  Conv2d & \multicolumn{5}{|c|}{in\_ch:512, out\_ch: 512,kernel=3x3,stride=1,padding=1) } \\

\hline
\multirow{5}*{upsample\_block\_3} &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:512, out\_ch: 256 } \\
~ &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:256, out\_ch: 256 } \\
~ &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:256, out\_ch: 256 } \\
~ &  Upsample(nearest\_interpolate) & \multicolumn{5}{|c|}{scale\_factor=2.0 } \\
~ &  Conv2d & \multicolumn{5}{|c|}{in\_ch:256, out\_ch: 256,kernel=3x3,stride=1,padding=1) } \\

\hline
\multirow{5}*{upsample\_block\_4} &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:128, out\_ch: 128 } \\
~ &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:256, out\_ch: 128 } \\
~ &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:128, out\_ch: 128 } \\
~ &  Upsample(nearest\_interpolate) & \multicolumn{5}{|c|}{scale\_factor=2.0 } \\
~ &  Conv2d & \multicolumn{5}{|c|}{in\_ch:128, out\_ch: 128,kernel=3x3,stride=1,padding=1) } \\

\hline
\multirow{3}*{upsample\_block\_5} &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:128, out\_ch: 64 } \\
~ &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:64, out\_ch: 64 } \\
~ &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:64, out\_ch: 64 } \\
% ~ &  ConvTranspose2d & \multicolumn{5}{|c|}{in\_ch:128, out\_ch: 128, kennel:3x3,stride = 2} \\
 

\hline

\multirow{2}*{final conv} &  Normalize & \multicolumn{5}{|c|}{GroupNorm,num\_groups=32, num\_channels=512} \\
~ &  Conv2d & \multicolumn{5}{|c|}{in\_ch:64, out\_ch: 6, kernel: 3x3, stride: 1, pad: 1 } \\
\hline

\end{tabular}
}
\label{tab:decoder}
\end{table*}


\begin{table*}[!t]
\caption{Architecture for ControlNet}
\resizebox{1.0\textwidth}{!}{
\begin{tabular}{|c|c|c|c|c|c|c|}
\hline
\multicolumn{2}{|c|}{layers} & \multicolumn{5}{|c|}{parameters} \\
\hline
\multirow{4}*{noise\_level\_t} & PositionalEncoding & \multicolumn{5}{|c|}{ t = t * exp(-log(1e4) * arange(64)); encoding = (sin(t), cos(t)) } \\
~ & mlp & \multicolumn{5}{|c|}{in\_ch:64, out\_ch: 256} \\
~ & Swish activation & \multicolumn{5}{|c|}{x*signoid(x)} \\
~ & mlp & \multicolumn{5}{|c|}{in\_ch:256, out\_ch: 64} \\

 \hline
\multirow{1}*{input} & Conv2d & \multicolumn{5}{|c|}{in\_ch:6, out\_ch: 64, kernel: 3x3, stride: 1, pad: 1 } \\
 \hline
 \multirow{1}*{zero\_conv} & Conv2d & \multicolumn{5}{|c|}{in\_ch:6, out\_ch: 64, kernel: 3x3, stride: 1, pad: 1 } \\
 \hline
\multirow{3}*{downsample\_block\_1} &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:64, out\_ch: 64 } \\
~ &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:64, out\_ch: 64 } \\
~ &  Downsample(Conv2d) & \multicolumn{5}{|c|}{in\_ch:64, out\_ch: 64,kernel:3x3, stride:2, padding=((0,1,0,1),val=0) } \\
\hline
 \multirow{1}*{zero\_conv} & Conv2d & \multicolumn{5}{|c|}{in\_ch:6, out\_ch: 64, kernel: 3x3, stride: 1, pad: 1 } \\
 \hline
\multirow{3}*{downsample\_block\_2} &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:64, out\_ch: 128 } \\
~ &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:128, out\_ch: 128 } \\
~ &  Downsample(Conv2d) & \multicolumn{5}{|c|}{in\_ch:128, out\_ch: 128,kernel:3x3, stride:2, padding=((0,1,0,1),val=0) } \\
\hline
 \multirow{1}*{zero\_conv} & Conv2d & \multicolumn{5}{|c|}{in\_ch:6, out\_ch: 64, kernel: 3x3, stride: 1, pad: 1 } \\
 \hline
\multirow{3}*{downsample\_block\_3} &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:128, out\_ch: 256 } \\
~ &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:256, out\_ch: 256 } \\
~ &  Downsample(Conv2d) & \multicolumn{5}{|c|}{in\_ch:256, out\_ch: 256,kernel:3x3, stride:2, padding=((0,1,0,1),val=0) } \\
\hline
 \multirow{1}*{zero\_conv} & Conv2d & \multicolumn{5}{|c|}{in\_ch:6, out\_ch: 64, kernel: 3x3, stride: 1, pad: 1 } \\
 \hline
\multirow{5}*{downsample\_block\_4} &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:256, out\_ch: 512 } \\
~ &  SelfAtt & \multicolumn{5}{|c|}{in\_ch:512, out\_ch: 512 } \\
~ &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:512, out\_ch: 512 } \\
~ &  SelfAtt & \multicolumn{5}{|c|}{in\_ch:512, out\_ch: 512 } \\
~ &  Downsample(Conv2d) & \multicolumn{5}{|c|}{in\_ch:512, out\_ch: 512,kernel:3x3, stride:2, padding=((0,1,0,1),val=0) } \\
\hline
 \multirow{1}*{zero\_conv} & Conv2d & \multicolumn{5}{|c|}{in\_ch:6, out\_ch: 64, kernel: 3x3, stride: 1, pad: 1 } \\
 \hline
\multirow{2}*{downsample\_block\_5} &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:512, out\_ch: 512 } \\
~ &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:512, out\_ch: 512 } \\

\hline
\multirow{3}*{middle} &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:512, out\_ch: 512 } \\
~ &  AttnBlock & \multicolumn{5}{|c|}{in\_ch:512 } \\
~ &  ResnetBlock & \multicolumn{5}{|c|}{in\_ch:512, out\_ch: 512 } \\
\hline
% \multirow{3}*{end} &  Normalize & \multicolumn{5}{|c|}{GroupNorm,num\_groups=32, num\_channels=512} \\

% ~ &  Activation & \multicolumn{5}{|c|}{x*sigmoid(x)} \\

% ~ &  Conv2d & \multicolumn{5}{|c|}{in\_ch:512, out\_ch: 256, kernel: 3x3, stride: 1, pad: 1 } \\
% \hline
\end{tabular}
}
\label{tab:ControlNet}
\end{table*}


