\section{Method}

\begin{figure*}[htb]
    \centering
    \includegraphics[width=\linewidth]{Figure/Figure2_framework.png}
    \caption{\textbf{The network structure of the proposed MagNet.} MagNet utilizes cross-attention layers to integrate features extracted from multi-resolution patches. Additionally, it incorporates a GAT-Transformer block to aggregate neighborhood information while leveraging spatial relationships. The predictions for each resolution level are then independently generated by a regression head.}
    \label{fig:framework}
\end{figure*}

\subsection{Unified Cross-Resolution Feature Aggregation}
We cropped patches at the bin, spot, and region levels for each bin $i$, denoted as $i_b$, $i_s$ and $i_r$. Features of these patches, represented as $f_b$, $f_s$ and $f_r$, are extracted by a pre-trained ResNet50~\cite{he2016deep}. We adopt the strategy proposed by TRIPLEX~\cite{chung2024accurate} that freezes the encoder parameters for the spot and region levels while updating only the bin-level encoder to minimize computational overhead.

To refine the representation of $f_b$, the features of other resolutions are treated as the key matrix (K) and the value matrix (V), with $f_b$ acting as the query matrix (Q). A cross-attention layer is used to effectively merge the features of $f_s$ and $f_r$ into $f_b$. Thus, the fused feature $f'_b$ is formulated as:
\begin{equation}
f'_b = \text{softmax} \left( \frac{f_b f_i^T}{\sqrt{d}} \right) f_i, \quad i=s, r
\end{equation}
where $\sqrt{d}$ is a scaling factor. Finally, by concatenating the features from all three levels, the fused multi-level feature $F$ is obtained for use in subsequent processes.

\subsection{Spatial-Guided Graph Integration Block}
To exploit the spatial relationship of pathological images, we propose a spatially-guided graph integration block that integrates GAT and transformer layers. The connections between bins are first established by calculating the weight $e_{ij}$ between any two nodes $i$ and $j$ using the Euclidean distance. The top-$k$ lowest $e_{ij}$ values are selected to establish connections within the whole-slide image. The constructed graph is then fed into the spatial-guided graph integration block for further processing.

Subsequently, after rounds of graph attention convolution, the processed feature $F_m^{i}$ for each $i_b$, $i_s$ and $i_r$ is formulated as follows:

\begin{equation}
\mathbf{F}_m^{i} = \bigg\Vert_{k=1}^K \sigma \left( \sum_{j \in \mathcal{N}(i)} \alpha_{ij}^k \mathbf{W}^k \mathbf{f}_m ^{j}\right), \{m|b,s,r\}
\end{equation}
where $\mathcal{N}(i)$ denotes the set of adjacent nodes, $\bigg\Vert$ represents concatenation operation, $\sigma$ is the activation function, $\alpha_{ij}^k$ is the weight of the $k$-th attention head, and $\mathbf{W}^k$ is a linear transformation matrix determined by the connections between nodes. 

A Transformer layer is used for adaptive aggregation of neighborhood information from each round, thus enhancing the representation of features. Finally, the regression head generates gene expression predictions for each level separately, denoted as $p_b$, $p_s$, and $p_r$.

\subsection{Loss Function}
To exploit the mutual consistency among multilevel information, we designed a hybrid loss function comprising prediction loss $L_p$ and consistency loss $L_c$ to optimize the model learning process. The prediction loss primarily focuses on minimizing the discrepancies between the model's predictions and the ground truth at each resolution level. For the prediction task at bin level, we employ Mean Squared Error (MSE) and Pearson Correlation Coefficient loss (PCC) to evaluate the model's performance. To avoid introducing additional noise, only PCC loss is utilized to assess the model's performance at the spot and region levels. Hence, the prediction loss is formulated as:
\begin{equation}
 L_p = MSE(p_b, y_b) + \sum_{i=b,s,r} \lambda_i \cdot PCC(p_i, y_i) 
\end{equation} % 
Here, $b$, $s$, and $r$ represent the bin, spot, and region levels, respectively. $p_i$ and $y_i$ denote the prediction of the model and its corresponding ground truth, while $\lambda_i$ is a hyperparameter used to balance the PCC loss at different resolution levels.

Since patches at different resolutions within the same region exhibit similar trends in gene expression, we employ PCC loss to constrain the differences between bin-level predictions and those at other levels. The consistency loss $L_c$ is defined as:
\begin{equation}
 L_c = \lambda_1 \cdot PCC(p_b, p_s) + \lambda_2 \cdot PCC(p_b, p_r) 
\end{equation} 
Thus, the overall loss of the model $L$ is defined as:
\begin{equation} 
L = \gamma_1 \cdot L_p + \gamma_2 \cdot L_c 
\end{equation}
Here, $\gamma_1$ and $\gamma_2$ are hyperparameters used to balance the two types of losses, and they are set to 1 and 0.25 in the subsequent experiments.
