\section{Method}
\subsection{Overall Framework}
We introduce SCR$^2$-ST, a unified framework that leverages single-cell prior knowledge to guide efficient data acquisition and accurate expression prediction under limited data budgets, as illustrated in Figure~\ref{fig:framework}. Within SCR$^2$-ST, our single-cell-guided reinforcement learning-based (SCRL) active sampling strategy integrates single-cell priors with spatial features, using a multi-objective reward function to iteratively drive the policy toward selecting the most informative spots. To fully exploit the abundant single-cell priors in prediction, we design a hybrid regression-retrieval prediction network SCR$^2$Net that fuses direct regression with retrieval-augmented soft label supervision on the actively sampled set.

\begin{figure*}[htbp]
    \centering
    \includegraphics[width=\textwidth]{Figure_pdf/Figure2_framework.pdf}
    \caption{\textbf{Overview of our proposed SCR$^2$-ST framework.}  \textit{Left}: Single-cell–guided reinforcement learning–based (SCRL) active sampling strategy that integrates vision features and external single-cell priors to iteratively select informative spots, ensuring efficient data acquisition under limited data budgets. \textit{Right}: Our SCR$^2$Net that infers gene expression from histology images with retrieval-augmented reference. Majority cell-type filtering guided by single-cell priors is applied to suppress unreliable matches in heterogeneous regions.}
    \label{fig:framework}
\end{figure*}
\subsection{Active Sampling via Single-cell Guided Reinforcement Learning}
\subsubsection{Policy Network for Active Sampling}

Prior to active sampling, we perform dense visual feature extraction on tissue sections. Specifically, we uniformly partition WSIs patches and employ the pre-trained UNI~\cite{chen2024uni} to extract visual embeddings $\{e_i\}_{i=1}^{N}$ for each patch, along with their corresponding spatial coordinates $\{(x_i, y_i, w_i)\}_{i=1}^{N}$, where $w_i$ denotes the slide identifier. Based on these embeddings, we construct a lightweight policy network $\pi_\theta(\cdot)$ that outputs a sampling priority score for each candidate location as 
\begin{equation}
\pi_{\theta}(e_i) = W_2 \cdot \mathrm{ReLU}(W_1 e_i),
\end{equation}
where $W_1 \in \mathbb{R}^{128 \times d}$ and $W_2 \in \mathbb{R}^{1 \times 128}$ are learnable parameters. The scores are then normalized into a probability distribution via softmax as $
p_i = \frac{\exp(\pi_{\theta}(e_i))}{\sum_{j=1}^N \exp(\pi_{\theta}(e_j))}.
$
At iteration $t$, the policy network samples $k$ new locations $S_t$ from the unsampled candidate set $\mathcal{U}_t$ according to this probability distribution, and adds them to sampling pool $\mathcal{S} = \bigcup_{\tau \leq t} S_\tau$. The sampling process terminates when the number of samples reaches the total budget $B$.

\subsubsection{Multi-Objective Reward Design}

After obtaining sample set $S_t$ at round $t$, we evaluate sampling quality and construct multi-objective reward signals to update the policy network.  We extract ST expression embeddings $\{\mathbf{z}_i\}_{i \in S_t}$ by pretrained scGPT~\cite{cui2023scGPT} for sampled locations and reference embeddings $\{\mathbf{q}_j\}_{j=1}^{M}$ from external single-cell data. The reward function measures sampling quality from two complementary perspectives: biological diversity and spatial uniformity.

\noindent \textbf{Single-Cell Prior-Guided Biological Diversity Reward.}
To quantify how well the sample set explores the single-cell state space, we first apply PCA to reduce the single-cell embeddings $\{\mathbf{q}_j\}$ to 50 dimensions, then cluster them into $C$ latent cell state clusters using MiniBatchKMeans, obtaining the cluster center set $\{\boldsymbol{\mu}_c\}_{c=1}^{C}$. The coverage reward measures the fraction of clusters reached by the sample set:
\begin{equation}
R_{\mathrm{sc}}(S_t) = \frac{\left| \left\{ \arg\min_{c} \|\mathbf{z}_i - \boldsymbol{\mu}_c\|_2 : i \in S_t \right\} \right|}{C},
\end{equation}
where $|\cdot|$ denotes set cardinality. Each sampled point $\mathbf{z}_i$ is assigned to its nearest cluster, and coverage is computed as the ratio of unique clusters covered to total clusters $C$.

We then match each ST embedding $\mathbf{z}_i$ to the most similar single-cell embedding via cosine similarity and retrieve the corresponding cell type label to evaluate the cell type diversity of selected samples as:
\begin{equation}
j^*(i) = \arg\max_{j} \frac{\mathbf{z}_i^\top \mathbf{q}_j}{\|\mathbf{z}_i\| \|\mathbf{q}_j\|},
\end{equation}

We compute the cell type distribution in the sample set as $P(k) = |\{i \in S_t : \mathrm{type}(j^*(i)) = k\}| / |S_t|$, and define the diversity reward based on normalized entropy:
\begin{equation}
R_{\mathrm{type}}(S_t) = \frac{-\sum_{k} P(k) \log(P(k) + \varepsilon)}{\log(K + \varepsilon)},
\end{equation}
where $k$ indexes individual cell types, $K$ is the number of distinct cell types observed in the sample set, and $\varepsilon$ is a small constant for numerical stability, encouraging preferential sampling of regions with greater cellular heterogeneity.


\noindent \textbf{Spatial Distribution Diversity Reward.} Spatial distribution of sampled points also affects information density. An ideal sampling strategy should balance two objectives: (1) spatial dispersion to avoid over-clustering in local regions; (2) uniform coverage to ensure that unsampled locations have nearby reference points. Therefore, we define dispersion $D_{\mathrm{disp}}$ as the average pairwise distance among sampled points, where larger values indicate better dispersion. We define coverage $D_{\mathrm{cover}}$ as the average distance from all candidate locations to their nearest sampled point, where smaller values indicate more uniform coverage:
\begin{equation}
D_{\mathrm{disp}}(S_t) = \frac{1}{|S_t|^2} \sum_{i,j \in S_t} \|(x_i, y_i) - (x_j, y_j)\|_2, \quad
D_{\mathrm{cover}}(S_t) = \frac{1}{N} \sum_{i=1}^{N} \min_{j \in S_t} \|(x_i, y_i) - (x_j, y_j)\|_2.
\end{equation}
The spatial distribution diversity reward combines both metrics:
\begin{equation}
R_{\mathrm{spa}}(S_t) = \frac{D_{\mathrm{disp}}(S_t) + D_{\mathrm{cover}}(S_t)}{2}.
\end{equation}

\subsubsection{Combined Reward and Policy Optimization}

We linearly combine the three reward components into a composite signal:
\begin{equation}
R(S_t) = w_{\mathrm{sc}} \cdot R_{\mathrm{sc}}(S_t) + w_{\mathrm{type}} \cdot R_{\mathrm{type}}(S_t) + w_{\mathrm{spa}} \cdot R_{\mathrm{spa}}(S_t),
\end{equation}
where $w_{\mathrm{sc}}$, $w_{\mathrm{type}}$, and $w_{\mathrm{spa}}$ control the relative contributions of single-cell manifold coverage, cell type diversity, and spatial distribution diversity, respectively. We then update the policy network parameters using the composite reward:
\begin{equation}
\nabla_\theta \mathcal{J} = \mathbb{E}_{S_t \sim \pi_\theta} \left[ R(S_t) \cdot \nabla_\theta \log \pi_\theta(S_t) \right],
\end{equation}
where $\mathcal{J}$ is the expected cumulative reward. Through gradient ascent optimization, the policy network progressively learns to balance biological diversity and spatial uniformity, steering the sampling strategy toward more informative tissue regions.

\subsection{SCR$^2$Net: Single-Cell Guided Regression-Retrieval Network}

To further leverage single-cell prior knowledge, we design SCR$^2$Net with two complementary paths, including a direct regression path for image-to-expression mapping, and a retrieval-augmented path that provides soft supervision by retrieving similar samples from the training set as an external knowledge base.

\subsubsection{Single-Cell Guided Retrieval Module}

\noindent \textbf{Cross-Modality Alignment.} Direct regression alone struggles to capture complex expression patterns under limited training samples. To address this, we introduce a retrieval-augmented module that treats the training set as an external memory bank encoding single-cell knowledge, providing soft supervision for the regression pathway.

We design two projection heads with identical architecture to map image features $f_{img}$ from the visual encoder and gene expression embeddings into a shared semantic space. An InfoNCE loss $\mathcal{L}_{con}$ is applied to align vision-omics representations and update the projection head. We then compute cosine similarity between the query image and reference samples:
\begin{equation}
\mathrm{sim}(f_{img}, y_j) = \frac{\phi_{\mathrm{img}}(f_{img})^\top \phi_{\mathrm{expr}}(y_j)}{\|\phi_{\mathrm{img}}(f_{img})\| \|\phi_{\mathrm{expr}}(y_j)\|},
\end{equation}
where $\phi_{\mathrm{img}}$ and $\phi_{\mathrm{expr}}$ denote the image and expression projection heads, respectively. We select the top-$K$ most similar samples to construct the reference set.

\noindent \textbf{Cell-Type-Aware Filtering and Knowledge Distillation.} Expression patterns vary significantly across cell types in ST data, while vision representation could resemble, thus directly aggregating all retrieved samples may introduce noise. To ensure biological consistency, we introduce a majority cell-type filtering mechanism. We count cell type distribution among the top-$K$ samples and retain only those belonging to the $T$ most frequent cell types. The mean expression of filtered samples serves as the retrieved soft label $
\hat{y}_{\mathrm{ret}} = \frac{1}{|\mathcal{R}|} \sum_{j \in \mathcal{R}} y_j,$
where $\mathcal{R}$ denotes the filtered retrieval set.

To account for retrieval quality, we introduce a similarity-based confidence mask $m$, where higher retrieval similarity leads to greater weight on the distillation loss. The retrieved prediction $\hat{y}_{\mathrm{ret}}$ then guides the regression path through knowledge distillation, with loss function $\mathcal{L}_{ret}$ weighted by a hyperparameter $\lambda_{\mathrm{KD}}$ denoted as:
\begin{equation}
\mathcal{L}_{ret} = \lambda_{\mathrm{KD}} \cdot m \cdot \|\hat{y} - \hat{y}_{\mathrm{ret}}\|^2 
\end{equation}


\subsubsection{Regression Path and Training Objective}
We adopt DenseNet-121~\cite{huang2017densely} pre-trained on ImageNet as the visual encoder to capture histomorphological patterns. Given an input patch, the encoder produces a compact feature vector through global average pooling. A two-layer MLP then decodes the visual features into gene expression predictions $\hat{y}$, forming the direct regression path.

To supervise the regression prediction, we employ two complementary losses. The MSE loss $\mathcal{L}_{\mathrm{reg}} = \|y - \hat{y}\|^2$ directly minimizes the difference between predictions and ground truth, while a Pearson Correlation Coefficient (PCC) loss $\mathcal{L}_{\mathrm{pcc}} = 1 - \mathrm{PCC}(y, \hat{y})$ to capture the correlation structure across genes, which is important for preserving gene-spatial relationships. The total loss integrates direct supervision from both regression losses and soft supervision from the retrieval-based distillation with hyperparameters $\lambda_r$ and $\lambda_p$, denoted as:
\begin{equation}
\mathcal{L} = \lambda_r \cdot \mathcal{L}_{\mathrm{reg}} + \lambda_p \cdot \mathcal{L}_{\mathrm{pcc}} + \mathcal{L}_{\mathrm{ret}},
\end{equation}
