

\section{Data and Experiments}
\noindent \textbf{Datasets and Preprocessing.}
We evaluate all methods using three public ST datasets, including HER2~\citep{andersson2021spatial}, Breast Cancer~\citep{he2020integrating}, and Kidney~\citep{lake2023atlas}. For each spot, we cropped a $224 \times 224$ pixel patch centered on spatial coordinates as model input. We selected top 300 genes with highest average variance as prediction targets. Following BLEEP~\citep{xie2024spatially}, we applied a $\log(1+x)$ transformation on raw readouts. For the external single-cell datasets, we use two million cells from~\cite{lake2025cellular} as the reference for the Kidney dataset, and around three million cells from~\cite{chen2025highly, reed2024single, klughammer2024multi} as the reference for the Breast Cancer and HER2 datasets. A detailed profile of datasets is provided in Appendix~\ref{sec:sup_data}.

% The HER2 dataset comprises 8 tissue samples with 36 WSIs and a total of 13,620 spots. The Breast Cancer dataset contains 23 samples with 68 WSIs and 30,066 spots. The Kidney dataset includes 22 samples with 23 WSIs and 25,944 spots. The spot diameter is 100 $\mu$m for both HER2 and Breast Cancer datasets, while 55 $\mu$m for the Kidney dataset. 

% We selected the top 300 genes with the highest average expression variance as prediction targets. Following BLEEP~\citep{xie2024spatially}, we applied a $\log(1+x)$ transformation directly to the raw count matrices to address the long-tailed distribution of gene expression data~\cite{he2020integrating}. The selected genes for each dataset are illustrated in Appendix Figure xxx.


\noindent \textbf{Baseline.}
We compared our model against SOTA methods, including regression-based models ST-Net~\cite{he2020integrating}, EGN~\cite{yang2023exemplar}, HisToGene~\cite{pang2021leveraging}, His2ST~\cite{zeng2022spatial}, and TRIPLEX~\cite{chung2024accurate}, and retrieval-based models BLEEP~\cite{xie2024spatially} and mclSTExp~\cite{min2024multimodal}. All methods were trained and evaluated under consistent experimental settings to ensure fair comparison. To validate our sampling strategy,
we select Monte Carlo random sampling, uncertainty-based sampling~\cite{safaei2024entropic}, and diversity-driven sampling~\cite{zhdanov2019diverse} for comparison.


\noindent \textbf{Evaluation Metrics.} We employed Pearson correlation coefficient (PCC), mean squared error (MSE), and mean absolute error (MAE) to comprehensively assess model performance in gene expression prediction from both spatial correlation and error perspectives.

\noindent \textbf{Implementation Details.} All experiments were conducted on a single NVIDIA RTX A6000 GPU. We employed SGD optimizer with momentum of 0.9 and weight decay of $10^{-4}$. The initial learning rate was set to $lr_0 = 10^{-4}$, with a cosine annealing schedule that gradually decays the learning rate to $10^{-6}$. The training batch size was set to 256. Details of experimental implementation and hyperparameter settings are listed in Appendix~\ref {sec:sup_implementation}.
