Title: GeSubNet: Gene Interaction Inference for Disease Subtype Network Generation

Abstract: Understanding gene functional networks is fundamental to biomedical research, relying heavily on both comprehensive biological knowledge bases like STRING (Szklarczyk et al., 2023) and KEGG (Kanehisa et al., 2024), and rich experimental data such as patient gene expression profiles. A critical challenge arises from the inherent generalization of these knowledge bases, which often lack the specificity required to capture variations across distinct disease subtypes. Current methods struggle to effectively integrate gene interaction knowledge with subtype-specific variations, leading to misinterpretations of gene behaviors across different disease contexts.

To bridge this critical gap, we introduce GeSubNet, a novel multi-step representation learning framework. GeSubNet learns a unified representation that accurately predicts gene interactions while explicitly distinguishing between different disease subtypes, thereby generating highly targeted subtype-specific networks. It achieves this through three integrated modules: a deep generative model for patient subtyping, a graph neural network for learning prior gene interactions, and a novel inference mechanism that unifies these representations to generate refined, subtype-specific networks.

GeSubNet consistently outperforms traditional and deep learning methods, demonstrating average improvements of 30.6%, 21.0%, 20.1%, and 56.6% across four graph evaluation metrics, averaged over four diverse cancer datasets. Furthermore, through a biological simulation experiment involving over 11,000 gene candidates, we show that the generated networks have the potential to identify subtype-specific genes with an 83% likelihood of impacting patient distribution shifts. This work significantly advances the generation of biologically meaningful, subtype-specific gene networks, offering new avenues for precision medicine.

Section: INTRODUCTION
Understanding disease-gene associations is fundamental to biomedical research, relying heavily on both comprehensive biological knowledge bases like STRING (Szklarczyk et al., 2023) and KEGG (Kanehisa et al., 2024), and rich experimental data such as patient gene expression profiles. A critical challenge arises from the inherent generalization of these knowledge bases, which often lack the specificity required to capture variations across distinct disease subtypes. This paper addresses this gap by introducing a novel deep learning approach that effectively integrates generalized knowledge with subtype-specific experimental data to construct highly targeted and biologically meaningful knowledge graphs.
Decades of research have generated extensive disease-gene association data, compiled into various biological knowledge databases (Goh et al., 2007b;Szklarczyk et al., 2023;Kanehisa & Goto, 2000). These databases integrate known and predicted gene interactions, forming gene functional networks that describe how gene behaviors relate to disease processes. They support disease research by interpreting experimental results (Vella et al., 2017), facilitating biomarker discovery (Yang et al., 2023), and enabling personalized treatment (Goossens et al., 2015). Alongside these general knowledge bases, in-lab experimental data, such as patient gene expression profiles, offer crucial insights by filtering candidate genes whose interactions are more relevant to specific disease subtypes. However, a significant mismatch persists between the broad scope of generic knowledge bases and the granular detail of experimental data when studying disease subtypes. For instance, as illustrated in Figure 1, breast cancer encompasses multiple subtypes (e.g., Luminal A, Luminal B, and Basal-like), yet databases like STRING typically provide only a general gene network applicable to all subtypes. This generalization can lead to misinterpretations of gene behaviors and hinder the development of targeted therapies.
While bio-researchers have proposed data generation approaches to construct meaningful subtype-specific networks (Zaman et al., 2013), these often necessitate extensive in-lab analyses, such as laborious pair-wise gene examinations among hundreds to thousands of gene candidates. This paper introduces GeSubNet, a novel data-driven approach designed to automate the integration of gene expression data and knowledge databases, directly generating gene functional networks tailored for various disease subtypes. This automation significantly reduces the reliance on manual, labor-intensive experimental validation.
Related Work. Existing methods for generating subtype gene networks can be broadly categorized into two main groups: statistical and deep learning-based methods. Statistical methods primarily focus on accelerating gene filtering by mining experimental data. These approaches typically employ similarity metrics to quantify correlations between genes. High correlations, such as those observed in co-expressed genes (Zhang & Horvath, 2005), are often interpreted as functional interactions. For example, ARACNe (Margolin et al., 2006) utilizes mutual information to measure expression similarity and subsequently removes indirect links with low similarity. WGCNA (Langfelder & Horvath, 2008) calculates Pearson correlation to facilitate large-scale comparisons, while wTO (Gysi et al., 2018) transforms these correlations into probabilistic measures. Despite their utility, these statistical methods often prioritize genes of interest and may not fully capture the complex, multifaceted nature of gene interactions in a subtype-specific manner.
A growing number of deep learning methods leverage both knowledge databases and experimental datasets. These methods often represent disease networks as graphs and embed gene expression data, which contains diverse patient information, as node embeddings. They typically employ graph neural networks (GNNs) for tasks such as link prediction and graph reconstruction, where the newly reconstructed graphs are intended to represent specific disease networks. Representative methods include GAERF (Wu et al., 2021a), which learns node features using a graph auto-encoder and then employs a random forest for link prediction. CSGNN (Zhao et al., 2021) predicts gene interactions by combining a mix-hop aggregator with a self-supervised GNN. LR-GNN (Kang et al., 2022) proposes a dynamic graph method to gradually reconstruct graph structure, aiming to mitigate the constraints imposed by prior general disease network information. Recent works have also focused on improving the accuracy of gene-gene link prediction (Li et al., 2024;Pang et al., 2024). However, a common limitation of these approaches is their primary objective: to reconstruct general disease-gene associations, which often includes irrelevant interactions. Consequently, these methods do not explicitly learn or highlight the distinct gene interactions unique to specific disease subtypes, which is crucial for precision medicine.
Contributions and Novelty. We present GeSubNet, a novel solution specifically designed to leverage distinct subtype information from experimental data, i.e., gene expression profiles, to directly infer gene interactions specific to disease subtype networks. GeSubNet learns a unified representation that can accurately predict prior gene interactions while simultaneously distinguishing between different subtypes of a disease. The graphs generated by such representations are therefore truly subtype-specific networks.
GeSubNet operates as a multi-step learning framework, featuring independent data representation learning and a sophisticated integration mechanism. The first step involves a deep generative model to learn gene expression representations that capture distinct data distributions and effectively distinguish subtypes within a latent feature space. The second step employs a GNN to learn robust graph representations of prior gene networks, ensuring GeSubNet captures biologically accurate gene-gene functional interactions documented in knowledge databases. Finally, we integrate these two representations through a novel inference module, which updates graph representations and infers subtype-specific gene interactions using a reconstruction loss conditioned on the gene expression data.
Our extensive experiments confirm that GeSubNet can simultaneously generate highly differentiated subtype networks within a general cancer context. The key contributions of this paper are:
• Formulating a Novel Problem. We formally frame the problem of inferring gene interactions in a way that directly helps models distinguish subtypes in experimental datasets. This work introduces an automated method for integrating gene expression data and knowledge databases, explicitly generating disease subtype networks that are tailored to specific patient groups.
• Proposing an Automated Data Integration Methodology. GeSubNet offers an effective and innovative architecture that combines a Vector Quantized-Variational AutoEncoder (VQ-VAE) and Neo-GNN. This integration achieves significant average improvements of 30.6%, 21.0%, 20.1%, and 56.6% across three key metrics on four diverse cancer datasets. Furthermore, the modular design of GeSubNet allows for easy integration of more advanced models in the future.
• Demonstrating Broad Biological Relevance and Novel Evaluation. We propose impactful biological evaluations, including a new metric, to rigorously assess the generated networks. Experiments involving 11,327 gene evaluations robustly demonstrate that genes selected by GeSubNet are highly related to specific subtypes. We are the first to conduct a simulated experiment, termed "Knock-out" (Bergman & Siegal, 2003), to assess how the behavior of selected genes affects different subtypes. The proposed Shift Rate (∆ SR ) metric effectively evaluates the reliability of selected gene interactions, showing that GeSubNet significantly narrows down key genes with high biological significance.
• Integrated and Publicly Available Datasets for Cancer Subtyping. We have meticulously collected physical cancer-gene networks across four comprehensive knowledge databases and constructed machine-learning-ready datasets for both experimental validation and future research. We are releasing our datasets with this paper to support continued investigation and foster advancements in cancer subtyping. The code and data resources are publicly available at: https://github.com/chenzRG/GeSubNet

Section: PRELIMINARY AND PROBLEM SETTING
2.1 BACKGROUND: CANCER SUBTYPE Cancer is a major public health concern with increasing incidence and leading to mortality. The National Cancer Institute (NCI) reports that the high costs of cancer care have been projected to grow to $246.6 billion by 2030 (COS, 2023). A key driver of these high costs and morbidity is cancer's inherent heterogeneity. Each cancer type is made up of multiple subtypes, characterized by distinct biochemical mechanisms, requiring specific therapeutic approaches (Balmain et al., 2003). While these subtypes may differ biochemically, they often share similar morphological traits, such as the physical structure and form of the organism (Yang et al., 2023), complicating precise diagnosis and treatment responses. This complexity highlights the need for deeper research into gene networks specific to cancer subtypes. However, as shown in Figure 1, current knowledge bases like STRING provide only broad cancer gene networks without distinguishing between subtypes. This limitation in specificity creates a gap in effectively targeting treatments based on unique subtype characteristics. Our paper addresses this problem by focusing on advancing research and tools that differentiate these subtypes at a more granular level.

Section: PROBLEM SETTING
Definition 1 (Gene expression data). The fundamental entity in gene expression profile data is the individual patient. Each patient profile comprises tens of thousands of genes with measured features. Let X = {x (m) } M m=1 denote a dataset of M patients. Each patient can be represented as N sequence of gene measures
x (m) = {x (m) 1 , x (m) 2 , • • • , x (m) N }. Let Y = {y 1 , y 2 , • • • , y |Y| }
denotes the set of subtypes for a cancer. Each x (m) is associated with a label y. Step 2: Graph-M sets up a link prediction task to train the GNN encoder and decoder, learning the graph representation (Z g ) from the input gene graph (G) and expression data (X).
Step 3: Infer-M uses an objective function that integrates representations to generate subtype-specific networks. The reconstruction from Patient-M, conditioned on the GNN training in Graph-M (q θ (z g |G)), refines the graph structure, while ensuring accurate patient profile reconstruction (p ϕ (x|x)).
Definition 2 (Knowledge gene networks). A gene network, as compiled in knowledge databases, can be represented as a general graph G = (V, E) cross all M patients, where V denotes the set of vertices, corresponding to the genes and E is the set of edges representing the gene interactions/links. Here, a link can be represented as e ij = (v i , v j ), where i, j ∈ N .
Problem (Subtype-specific gene network inference). Given a general disease-gene network G, we assume that it can be decomposed into a set of sub-graphs G y = {G 1 , G 2 , . . . , G |Y| }, corresponding to Y subtypes. The links, as defined in knowledge databases, are directly transformed into a set of edges {0, 1} N K.I.
→ e ij ∈ [0, 1] N , where K.I. denotes knowledge-based initialization for graph construction. We aim to integrate the gene expression profile X to identify specific link sets relevant to a given subtype, formalized as F (•) : e ij → {0, 1} (y) . Notably, these sub-graphs are not independent.
Remark. The function F (•) is designed by existing methods focusing on reconstructing the general graph G. The learned representations only carry information for accurate reconstruction. In contrast, we investigate how to learn a representation from both data sources, one that captures essential information from gene interactions while distinguishing different subtypes. Our investigation is based on the following observation: the onset of complex diseases is typically attributed to changes (e.g., perturbations or disruptions) within a limited subset of genes (Goh et al., 2007a).
Formally, given X and a knowledge graph G, we have {G 1 , G 2 , . . . , G |Y| } = F (X; G). We aim to learn a unified representation Z with two properties: ( i ) encode high-quality Z from gene expression profiles X, that is, any z (m) and x (m) should correspond to the same patient group y; ( ii ) enable Z to predict gene-gene interactions in E. For the sub-graphs, we have two hypotheses:
• Hypothesis-1. The size of the sub-graph should be |G| y ≪ |G|, in terms of both the node set V and the link set E, while having large margin differences with other sub-graphs.
• Hypothesis-2. G y must maintain physical and biological meaningfulness. This is an important metric evaluated in experiments.

Section: GESUBNET


Section: FRAMEWORK
GeSubNet consists of three modules: patient sample representation learning module (Patient-M), graph representation learning module (Graph-M), and network inference module (Infer-M).
• Patient-M: This module sets up a cancer subtyping task, aiming to project patient gene expression profiles into a latent representation Z p , which can distinguish subtypes. This is typically an unsupervised learning task (Withnell et al., 2021;Yang et al., 2021b;a;2023). GeSubNet employs a Vector Quantized-Variational AutoEncoder (VQ-VAE) (Van Den Oord et al., 2017) for two purposes: (i) to model this discriminative latent space using a flexible categorical distribution (Chen et al., 2023a), and (ii) to use the decoder as a key component of Infer-M.
• Graph-M: This module forms a link prediction task, leveraging both knowledge databases and gene expression data to learn Z g . The goal is to train a well-performed GNN autorencoder, where the encoder learns holistic gene interactions, and the decoder is used to generate new graphs. Since we focus on interactions, GeSubNet employs Neo-GNN (Yun et al., 2021), which combines structural information with node representations to prevent over-smoothing of node features.
• Infer-M: This module involves a novel objective function that integrates Z p and Z g . GeSubNet uses the information from Patient-M, the decoder, and the reconstruction loss to optimize the prior knowledge in the gene network, i.e., the GNN encoder, for generating subtype-specific networks.

Section: SUBTYPE GENE NETWORK INFERENCE
Gene Expression Representation Learning -Patient-M. Given a gene expression dataset X ∈ R M ×N , we first encode the gene expression profile to a low-dimensional embedding Z e ∈ R M ×D through linear layers with ReLU activation function: Z e = ReLU(Linear(X)), where D is the dimension of Z e . We apply a Batch Normalization operation to prevent overfitting the limited patient gene expression samples. The Z e is then projected along the D-axis into a set of feature vectors Z c ∈ R M ×D×S , where S denotes the vector dimension. Then, we project Z c into a discrete codebook (Van Den Oord et al., 2017;Chen et al., 2023b). This involves encoding each dimension of gene features into a code, resulting in Z p The codebook consists of K latent vectors P 1:K , which defines a K-way categorical distribution. The projection is conducted using the nearest neighbor search. Then, a decoder, consisting of linear layers with ReLU activations, reconstructs the original gene expression profiles, X ∈ R M ×N . Gene Interactions Representation Learning -Graph-M. Given general graphs G represented by an adjacency matrix A and gene expression data X, we learn structural feature representations X ′ ∈ R v×u using two MLPs:
X ′ = MLP node ( j i=1 MLP edge (A ij ), X)
, where the first MLP handles edges and the second handles nodes. Next, we encode X ′ with A to obtain graph representations Z g ∈ R N ×D . The GNN decoder computes similarity scores between paired node embeddings by first computing the element-wise product of Z (i) g and Z (j) g . The resulting D-dimensional product is then aggregated into a single value as the similarity score. Finally, we train a binary classification MLP to perform the link prediction task: Ẽij := 1, Similarity Score ≥ 0.5 0, Otherwise where 1 indicates the presence of a link between node vi and vj, and 0 indicates the absence of a link. We use the predicted result Ẽij to guide Graph-M in learning prior gene interaction knowledge. Subtype Network Inference -Infer-M. This module integrates information from both Patient-M and Graph-M to optimize the prior cancer network and generate subtype-specific networks. We propose an objective function that uses Graph-M's graph generation capabilities, conditioned on the patient separation loss in Patient-M. GeSubNet follows three independent training phases.
Recall that we first train a well-initialized Patient-M to learn Z p using gene expression profiles X. This captures distinct subtype information through the following loss function:
L(ϕ; x) := -E q ϕ (ze|x) [log p ϕ (x|z q )](1)
where ϕ represents the parameters of the encoder and decoder. Then, we implement Graph-M to map predefined gene interactions for a given cancer into Z g :
L(θ; ω) := - 1 E E i=1
[h e log( ĥe (ω; z e (θ)
)) + (1 -h e ) log(1 -ĥe (ω; z e (θ)))](2)
where θ and ω are the parameters of the encoder and decoder in the GNNs, and h e represents the ground truth for the presence of a gene interaction. After training L(ϕ; x) and L(θ; ω), we first fix the model parameters ϕ and ω, and reconstruct a new gene expression profile via matrix multiplication:
X = Z p • Z g T .
The reconstruction error between the integrated X and the original patient gene expression profile X is used to optimize the parameters θ of the graph encoder by:
L(θ; x) = -E q θ (zg|G) [log p ϕ (x|x)](3)
Here, the graph encoder conditions the reconstruction of patient or subtype-specific gene expression profiles. This ensures that graph representations capture the subtle characteristics of each patient's gene expression profile, inferring the newly generated links/interactions more relevant to subtypes.  et al., 2010), glioblastoma multiforme (GBM) (Urbańska et al., 2014), brain lower grade glioma (LGG) (Forst et al., 2014), and ovarian serous cystadenocarcinoma (OV) (Jayson et al., 2014). Detailed information can be found in Table 1 and Appendix B.1.

Section: EXPERIMENTS
-Preprocessing: TCGA collected cancer samples from various experimental platforms with different patient information, such as gene sequencing results, and lacked alignments. First, we removed the unmatched gene IDs across cancer samples to ensure platform independence. Then, we identified and eliminated genes with zero expression (based on a threshold of more than 10% of samples) or missing values. Finally, we converted the scaled estimates in the original gene-level RSEM (RNA-Seq by expectation maximization) files to FPKM (fragments per kilo-million base) mapped reads data. The detailed data preprocessing pipeline can be found in Appendix B.2.
Gene network dataset. We collected gene functional networks corresponding to these four cancer types from four well-used knowledge databases, including KEGG (KE) (Kanehisa & Goto, 2000), STRING (ST) (Szklarczyk et al., 2015), InterPro (Int) (Paysan-Lafosse et al., 2023), and Monarch (Mona) (Mungall et al., 2017).
-Preprocessing: We searched and downloaded raw network data through website APIs. We mapped the gene IDs in the expression dataset to the standard format of Entrez Gene IDs (Maglott et al., 2010) in the networks. We stored gene interactions with the shared gene IDs across both datasets. Finally, we reconstructed the raw data as a binary matrix to initialize the gene graph construction. More details of the datasets and preprocessing can be found in Appendix B.3 and B.4. Baselines. We collected baselines from both the statistical methods and GNN-based methods.
(1)
The statistical methods include WGCNA (Langfelder & Horvath, 2008), which identifies modules of highly correlated genes using Pearson correlation; wTO (Gysi et al., 2018), which normalizes correlation by all other correlations and calculates probabilities for each edge in the network; ARACNe (Margolin et al., 2006), which calculates mutual information between pairs of nodes and removes indirect relationships; and LEAP (Specht & Li, 2017), which utilizes pseudotime ordering to infer directional relationships.
(2) The GNN-based methods include GAERF (Wu et al., 2021a), which learns node features with a graph auto-encoder and a random forest classifier; LR-GNN (Kang et al., 2022), which generates node embeddings with a GCN encoder and applies the propagation rule to create links; and CSGNN (Zhao et al., 2021), which predicts node interactions using a mix-hop aggregator and a self-supervised GNN. More details are provided in Appendix C. 
Method BRCA GBM LGG OV CDV (↑) GED (↑) DCS (↓) CDV (↑) GED (↑) DCS (↓) CDV (↑) GED (↑) DCS (↓) CDV (↑) GED (↑) DCS (↓)
WGCNA 0.42 ± 0.02 0.39 ± 0.03 0.83 ± 0.04 0.43 ± 0.02 0.47 ± 0.03 0.83 ± 0.04 0.45 ± 0.03 0.53 ± 0.02 0.82 ± 0.04 0.24 ± 0.02 0.25 ± 0.03 0.83 ± 0.04 wTO 0.44 ± 0.02 0.43 ± 0.02 0.79 ± 0.03 0.45 ± 0.02 0.47 ± 0.02 0.83 ± 0.04 0.43 ± 0.03 0.59 ± 0.03 0.76 ± 0.04 0.26 ± 0.02 0.25 ± 0.03 0.83 ± 0.04 ARACNe 0.47 ± 0.02 0.45 ± 0.03 0.73 ± 0.03 0.44 ± 0.02 0.43 ± 0.02 0.79 ± 0.03 0.43 ± 0.03 0.57 ± 0.03 0.76 ± 0.04 0.23 ± 0.02 0.25 ± 0.03 0.81 ± 0.03 LEAP 0.49 ± 0.03 0.44 ± 0.03 0.78 ± 0.03 0.48 ± 0.03 0.45 ± 0.03 0.78 ± 0.03 0.44 ± 0.03 0.55 ± 0.02 0.77 ± 0.04 0.22 ± 0.02 0.24 ± 0.03 0.84 ± 0.04 GAERF 0.54 ± 0.06 0.58 ± 0.07 0.64 ± 0.05 0.46 ± 0.04 0.48 ± 0.06 0.76 ± 0.05 0.55 ± 0.05 0.56 ± 0.06 0.83 ± 0.07 0.34 ± 0.05 0.36 ± 0.06 0.82 ± 0.06 LR-GNN 0.54 ± 0.05 0.59 ± 0.06 0.62 ± 0.04 0.57 ± 0.06 0.61 ± 0.07 0.75 ± 0.05 0.56 ± 0.06 0.66 ± 0.07 0.72 ± 0.05 0.34 ± 0.05 0.37 ± 0.06 0.82 ± 0.07 CSGNN 0.65 ± 0.06 0.66 ± 0.07 0.52 ± 0.06 0.65 ± 0.07 0.64 ± 0.06 0.74 ± 0.05 0.58 ± 0.06 0.68 ± 0.07 0.73 ± 0.06 0.35 ± 0.05 0.35 ± 0.06 0.80 ± 0.05 GeSubNet 0.75 ± 0.04 0.78 ± 0.04 0.47 ± 0.05 0.73 ± 0.04 0.74 ± 0.05 0.67 ± 0.04 0.67 ± 0.05 0.74 ± 0.04 0.62 ± 0.05 0.45 ± 0.04 0.44 ± 0.04 0.75 ± 0.04

Section: EXPERIMENT-I: NETWORK INFERENCE
Objective. This experiment evaluates the effectiveness of subtype-specific networks, following our Hypothesis-1: (1) |G| y ≪ |G|, ensuring the generated network is sparse compared to the original;
(2) each subtype network exhibits structural differences from the others. Setup and Metrics. We train GeSubNet for each cancer (the parameter settings can be found in Appendix D), and then evaluate the generated graphs for subtypes on two factors:
• Sparsity Assessment: we utilize the Coefficient of Degree Variation (CDV) (Pržulj, 2007) to measure the variability in gene nodes within a network. A higher CDV value indicates that most genes have very few interactions (edges). Thus, GeSubNet infers that the network becomes sparser because only a few active genes dominate the interactions in this subtype network. • Graph Structural Differences: we employ the Graph Edit Distance (GED) (Gao et al., 2010) and the DeltCon Similarity (DCS) (Koutra et al., 2013) to measure structural differences in gene networks. GED captures local changes in gene interactions, while DCS evaluates global structural similarities. A high GED value indicates significant differences in gene interactions. Conversely, a high DCS implies high similarity. Results. Table 2 presents GeSubNet significantly outperforms all baseline methods in terms of GED, DCS, and CDV metrics across four cancer types. Compared with the second-best baseline, CSGNN, GeSubNet achieves improvements of 35.8%/32.4%/20.2%/34.1% in terms of GED across all four tasks. Additionally, it delivers a relative reduction of 29.8%/13.5%/21.6%/19.3% in terms of DCS. For CDV, the improvements are 33.4%/13.7%/17.9%/15.3%, respectively. In summary, when evaluating BRCA, GBM, LGG, and OV, GeSubNet consistently achieves lower DCS scores and higher GED and CDV scores. This indicates that the generated subtype-specific gene networks are sparse but structurally unique, i.e., they are significantly different from each other. The OV results are apparently unsatisfactory, but this aligns with existing knowledge (Lawler et al., 2017) that OV is a challenging cancer type due to the limited available samples (only 291 patients in Table 1) and the lack of information on their pathogenic mechanisms in existing knowledge databases.

Section: EXPERIMENT-II: BIOLOGICAL MEANINGFULNESS
Objective. While three graph metrics show the statistical significance of the generated network, this experiment further evaluates their biological relevance, following our Hypothesis-2. (1) Instead of structural differences, we further assess whether each network shows biologically functional differences from other networks. (2) We examine whether the generated networks have the potential to narrow down key genes that contribute more specifically to their respective subtypes.
Setup and Metrics. We hence conduct two experiments as follows:
• Gene Ontology (GO) Analyses (Ashburner et al., 2000a): This method counts the number of unique GO terms (under the category Biological Process (Desmedt et al., 2008)) associated with the genes in each network. GO terms describe gene functions across biological processes, molecular functions, and cellular components, enabling comparisons between gene networks. For example, if GO(G 1 ) := {A, B, C} and GO(G 2 ) := {A, D, E}, where G 1 and G 2 represent two generated subtype gene networks. GO(•) denotes the sets of GO terms for two networks. Here the number of Enriched Biological Functions (#EBF) is 4, i.e., {B, C, D, E}, since A is the shared GO term. We evaluate GO for each cancer dataset across all baselines. A high #EBF value indicates greater functional diversity and biological differences between subtypes. • Simulated Gene Knockout (Bergman & Siegal, 2003): This is a computational technique that mimics the effects of gene knockout experiments without physically altering the genome. In this simulation, a gene is either deleted or deactivated to study its role within a specific subtype by observing changes in the patient sample distribution. As we described in an observation in Sec. 2, the key genes with significant expression differences form a small, limited set (Goh et al., 2007a), which leads to a distribution shift in patient samples during simulation experiments.
Our experiments follow three steps: (1) Rank all genes based on node degree disparities between the generated networks. (2) Group the genes into two sets: a high-ranking gene set and a lowranking gene set, based on a threshold. (3) Individually simulate the knockout for high-ranking and low-ranking gene sets by transforming their expression values to a non-expression level.
To evaluate the results, this paper proposes a new metric Shift Rate (∆ SR ) to measure the likelihood of distributional shifts in a subtype after a set of genes is knocked out. It calculates the average distance between the sample distributions before and after the knockout. We set a threshold (σ t ) based on the sample spread to assess the significance of distance. The ∆ SR is defined by :
∆ SR = 1 T t=1 1 n n i=1 ∥x before i -x after i ∥ > k • σ t (4)
where T is the total number of knockout tests, n is the number of patient samples within a subtype, x i represents an individual patient sample, k is a scaling factor (e.g., 1.0 or 1.5) used to adjust the threshold, and σ j is the standard deviation of sample distances. Notably, this metric is only used after model training and cannot be involved in modeling training. More details on the simulated Gene Knockout can be found in Appendix H. Results. Table 3 presents the results of the GO analysis , where our method consistently achieves the highest #EBF value across all datasets. These higher values indicate that the generated networks not only exhibit structural differences but also show functional distinctions from others from a biological perspective (Ashburner et al., 2000b;Wu et al., 2021b). Figure 3 presents Venn diagrams of detailed GO analysis for four cancer datasets, highlighting overlaps and unique in biological functions among three selected baselines and our method. We can consistently identify several unique functions across all datasets, while other methods rarely uncover unique functions, even when they achieve a comparable #EBF. For instance, in the LGG cancer dataset, CSGNN identifies 4 #EBF but finds no unique functions, whereas our method identifies 6 #EBF with 3 unique functions. From a biological perspective, our method demonstrates a robust array of enriched GO terms across different cancers, including pathways like Apoptotic signaling, Wnt signaling, Tumor necrosis factor signaling, and Cell proliferation. These terms represent critical cancer-related biological functions common to many cancers (Aktipis & Nesse, 2013), as shown in Table 9 in Appendix G. For
High-ranking genes Low-ranking genes

Section: Shift to other patient groups
No obvious shift
The patient group before simulating gene knockout
The patient group after simulating gene knockout (a)  unique functions, our method identifies the "Immune diseases" function in BRCA, which has evident support as being related to breast cancer (McAlpine et al., 2012), and the "DNA damage checkpoint signaling" pathway, which is specific to GBM (Cheng et al., 2011).
Figure 4 shows the results of the simulated gene knockout experiments. The subfigure (a) visualizes an example of patient distribution before (red-marked points) and after (green-marked points) the Simulated Gene Knockout in both target and control groups (subtypes). In the left subfigure, there are almost no differences between the before and after distributions for the low-ranking gene set.
In contrast, the right subfigure shows a significant shift in patient distribution, indicating that the suppression of high-ranking genes has a greater impact.
Figure 4(b) further provides a quantitative table across 11,327 genes in BRCA. our method achieves the highest ∆ SR for high-ranking genes, with an 83% likelihood of significantly shifting sample distributions. Meanwhile, the 12% ∆ SR for low-ranking genes suggests that our method effectively filters out common genes. Other baselines exhibit substantially lower ∆ SR values for high-ranking genes, ranging from 20%-30%, nearly matching those for low-ranking genes. Notably, while GNNbased methods like LR-GNN and CSGNN achieve comparable results in graph statistical metrics, their biological relevance is lower. This discrepancy arises because their objective functions aim only to reconstruct general disease networks, including irrelevant gene interactions, for all subtype samples. Although gene expression data embeddings result in different graph structures, these methods do not explicitly learn the distinct gene interactions unique to disease subtypes. However, learning a representation that incorporates prior knowledge while explicitly distinguishing patient samples in different subtypes is the key focus of this paper. This simulation experiment further validates the effectiveness of our method and demonstrates that GeSubNet maintains biological significance.

Section: CASE STUDY
The case study on BRCA cancer follows established protocols in bioinformatics gene function studies (Huang et al., 2009). The analysis workflow is available in Appendix I.1 and I.2.
Figure 5 shows the gene networks A and B obtained for two BRCA subtypes. We observe that our method can generate gene networks with more distinct gene nodes. The networks show significant differences between the two subtypes, whereas the baselines produce more similar networks. Figure 6(1) shows the gene expression distribution for the high-ranking and low-ranking gene sets. Different patient groups are marked in various colors to represent the ground truth. In the first column, we observe minimal differences in the expression distribution of low-ranking genes across patient groups. However, significant differences are evident in the high-ranking gene sets, as shown by the noticeable shift in distribution peaks. Figure 6(2) presents expression heatmaps for the top three genes in both the high-and low-ranking gene sets. For the high-ranking set, the genes are ERBB2, CCNA2, and CCNE1, while the lowranking set includes HHIP, MAPK1, and STK4. The high-ranking genes exhibit large differences in expression across subtypes, reflected by distinct color variations corresponding to the labels. 

Section: CONCLUSIONS
This paper introduced GeSubNet, a framework for inferring disease subtype-specific gene networks. GeSubNet includes sample and gene embedding learning modules that capture the characteristics of both patients and the prior gene graph. These embeddings are then utilized to reconstruct the input gene profile in the network inference module. This approach incorporates patient group information into the updated gene embeddings, enabling more accurate gene network inference specific to patient groups. In general, we explored a method for group-specific gene network inference on real-world clinical data. Importantly, we demonstrated the reliability of GeSubNet through a series of biological validations. We believe that continued investigation will bridge computational and biological sciences, advancing the understanding of diseases and foundational gene roles.

Section: B Dataset 16
B 

Section: B DATASET B.1 GENE EXPRESSION DATA
Gene expression refers to the process by which information from a gene is used to synthesize functional gene products, typically proteins. This process is tightly regulated and varies between cell types, tissues, and environmental conditions, such as the tumor microenvironment (Brazma & Vilo, 2000). By measuring gene expression levels, researchers can determine the activity of specific genes within a cell or tissue at any given moment.
Gene expression data has a long history been used in cancer research (Zhang et al., 1997) because cancer is driven by the dysregulation of cellular processes, which often manifests in abnormal gene expression patterns. High-throughput technologies, such as RNA sequencing (RNA-Seq) and microarrays, gather patient gene expression profiles and simultaneously enable large-scale measurement of gene expression across thousands of genes (Liang & Pardee, 2003). Gene expression data allows researchers to study the molecular mechanisms hidden deeply in tumor development and progression.
The gene expression data used in this study were collected from The Cancer Genome Atlas (TCGA) (The Cancer Genome Atlas Research Network, 2013), obtained through the world's largest cancer gene information database Genomic Data Commons (GDC) portal (Grossman et al., 2016). All candidate patient samples were generated across various experimental platforms from cancer samples before treatment. For the cancer research community, it is common for available data to be contributed from various cancer study projects and institutions. As a result, the data are typically generated from different assay platforms. This non-uniformity of assay platforms introduces technical variations, such as differences in experimental protocols. These inherent batch effects pose a challenge as they can significantly impact downstream model training and any further analysis.

Section: B.2 PREPROCESSING OF GENE EXPRESSION DATA
To ensure platform independence, we initially removed the cross-platform lost genes. For the gene expression (transcriptomics) data generated from the Hi-Seq platform, we converted the scaled estimates in the original gene-level RSEM (RNA-Seq by expectation maximization) files to FPKM (fragments per kilo-million base) mapped reads data. We initially identified and removed all nonhuman expression features for the remaining data generated from the Illumina GA and Agilent array platforms. Subsequently, we applied a logarithmic transformation to the converted data. To eliminate potential noise, we identified and eliminated features with zero expression levels (based on a threshold of more than 10% of samples) or missing values (designated as N/A). Table 5 describes the details of all experimental cancer gene expression datasets.
Preprocessing pipeline in R (Ver.4.2.1):
(1) Data Import: Gene expression data were loaded after download. data <-read.csv(''gene expression data.csv")
(2) Filtering Low-Quality Samples: Samples with a low number of expressed genes were removed using a default cutoff based on counts per million (CPM) values calculated with the edgeR package (Robinson et al., 2010).

Section: keep <-rowSums(cpm(data) > 1) >= 10 filtered data <-data[keep, ]
(3) Normalization: To account for differences in sequencing depth across samples, normalization was performed using the TMM (Trimmed Mean of M-values) method from the edgeR package. (5) Log Transformation: The gene expression data were log-transformed to stabilize variance across genes. log data <-log2(normalized data + 1) (6) Missing Data Imputation: Missing expression values were imputed using the 'impute' function from the impute package (T et al., 2022). imputed data <-impute.knn(log data)$data

Section: B.3 GENE NETWORK DATA
To obtain refined and coherent prior gene networks, we curated a comprehensive dataset by amalgamating information from diverse sources, including KEGG (Kanehisa & Goto, 2000), STRING (Szklarczyk et al., 2015), InterPro (Paysan-Lafosse et al., 2023), andMonarch (Mungall et al., 2017). These repositories collectively provide information on a broad spectrum of gene interaction corroborated by evidence from high-throughput lab experiments, co-expression analyses, genomic context predictions, disease-related gene pathways, and previously published knowledge.

Section: B.4 PREPROCESSING OF GENE NETWORK DATA
Our detailed preprocessing follows: We initiated the network construction process by retrieving related gene information through database APIs for a specified target cancer entry available in the databases above. To ensure uniformity in gene identifiers across disparate datasets, we harmonized gene IDs to the standard format of Entrez Gene IDs (Maglott et al., 2010). Subsequently, we identified and included common genes across all database sources as candidate nodes for constructing the prior network. Next, we retained common gene-gene associations obtained from multiple databases for each candidate node pair as the final edges to be preserved. Concurrently, isolated nodes were systematically removed from the network. During this curation of edges, we implemented two distinct screening strategies to elucidate two types of networks with edges embodying distinct correlation properties: (1) we identified edges denoting that the proteins are integral components of a physical complex, denoted as edges of Type I; and (2) we retained edges indicative of functional and physical protein associations, denoted as edges of Type II. This approach enhances our prior gene network by capturing diverse aspects of gene relationships and interactions.   et al., 2006). These methods have continually contributed to bio-network studies for many years.
In addition, we want to include more recently proposed methods in this category. Our selection principles were as follows: (1) the method was recently published (and approximately a decade after the above classical methods), ( 2) it provides open-source code (available on platforms such as GitHub), (3) it is easy to use (with outputs compatible with popular analysis packages), and ( 4) it has been used as a baseline in related bio-network research. Based on these criteria, we selected two methods, wTO (Gysi et al., 2018) andLEAP (Specht &Li, 2017).
For the GNN-based methods, our principles aligned with the above five lines for selecting newly proposed statistical methods. We selected three methods (i.e., GAERF (Wu et al., 2021a), LR-GNN (Kang et al., 2022), and CSGNN (Zhao et al., 2021)) that meet these principles.

Section: C.2 BASELINE METHODS
(1) Weighted Gene Co-expression Network Analysis (WGCNA) (Langfelder & Horvath, 2008) utilizes Pearson correlation to identify modules of highly correlated genes, where genes within the same module are likely to be functionally related or involved in similar biological processes.
(2) Weighted Topological Overlap (wTO) (Gysi et al., 2018) normalizes the chosen correlation by all other correlations and calculates a probability for each edge in the network.
(3) Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNe) (Margolin et al., 2006) calculates the mutual information between pairs of nodes and then removes indirect relationships during network building. (4) Lag-based Expression Association for Pseudotime-series (LEAP) (Specht & Li, 2017) utilizes pseudotime ordering to infer the directionality between genes in the network. (5) Graph Auto-encoder and Random Forest (GAERF) (Wu et al., 2021a) learns features of nodes by a graph auto-encoder and concatenates features of two nodes as input for the random forest classifier. (6) Link Representation-Graph Neural Network (LR-GNN) (Kang et al., 2022) generates embeddings using a GCN encoder and then applies a propagation rule to create link representations for predicting associations in networks. (7) Contrastive Self-supervised Graph Neural Network  (Zhao et al., 2021) predicts node interactions in networks by employing a mix-hop aggregator and a contrastive self-supervised GNN. WGCNA, wTO, ARACNe, and LEAP are well-used traditional methods that use only non-graph gene expression data as input, while GAERF, LR-GNN, and CSGNN are deep learning-based methods that use known paired integrations or networks as input. These methods are reported to have competitive performance for similar tasks like molecular interaction prediction. It is also worth noting that these methods tend to perform better when supplementary data, such as sequence data, is available.

Section: D HYPERPARAMETER SETTING
We conducted parameter sensitivity experiments to determine the optimal hyperparameters. The results are presented in Table 7. Overall, the findings indicate that the model's performance is not significantly affected by changes in the hyperparameters. We evaluated the computational requirements of the proposed method, including runtime (training and inference time) and memory usage, across four cancer datasets. While cancer patient gene expression datasets typically contain hundreds of patients for each cancer type, we assessed the method's scalability by conducting an additional test using the TCGA Pan-Cancer expression dataset based on the BRCA graph. This dataset includes 8,314 patient samples and is one of the largest pancancer datasets. The results are summarized in the Table 8. We find that the computational requirements of the proposed method are manageable for real-world cancer datasets and scale effectively to larger datasets with more patient samples.

Section: E COMPUTATIONAL REQUIREMENTS


Section: F EVALUATION METRICS
(1) Graph Edit Distance (GED). GED (Gao et al., 2010) measures dissimilarity between graphs by quantifying the minimum cost required to transform one graph into another through a series of edit operations, such as adding or deleting nodes and edges and modifying node or edge attributes. GED between two gene networks N 1 and N 2 is defined as:
GED(N 1 , N 2 ) = min π (u,v)∈π c(u, v).
Here π is a set of edit operations, typically represented as a set of pairs (u, v) where u is a node in N 1 and v is a node in N 2 . This set π represents the optimal alignment or correspondence between nodes in the two networks. c(u, v) is the cost associated with aligning nodes u and v, depending on factors such as node attributes, edge attributes, or the type of edit operation. We calculate the overall GED among n inferenced networks as:
GED(N 1 , N 2 , . . . , N n ) = 1 n(n-1) n i=1 n j=1,j̸ =i GED(N i , N j ).
(2) DeltCon Similarity (DCS). It is a similarity score calculated through the DeltCon algorithm (Koutra et al., 2013). DCS quantifies the structural similarity between two graphs by comparing the influence of nodes across these graphs. It relies on the computation of the influence matrix derived from the graph Laplacian. The similarity is based on how the node influences the change in values between the two graphs. DCS is mathematically defined as:
DCS(G 1 , G 2 ) = 1 -1 2 N i=1 N j=1 1 N N i=1 (I G1 (i, j) -I G2 (i, j))
2 , where N is the number of nodes in the graphs, and I G1 (i, j) and I G2 (i, j) represent the influence of node i on node j in graphs G 1 and G 2 , respectively.
(3) Coefficient of Degree Variation (CDV). The degree distribution of a gene network represents the frequency distribution of node degrees, indicating the number of interactions with each gene. In other words, this variation in connectivity suggests that the network's degree distribution implies that certain genes are more central or connected than others, and these central genes may have crucial roles in defining or influencing specific cancer subtypes. CDV (Pržulj, 2007) also decreases as the average degree ( k) of the network increases. CDV is defined as
: CDV = √ 1 N N i=1 (ki-k) 2 k√ N × 1 k .
Here, N is the total number of nodes, k i is the degree of node i, and k is the average degree.
(4) Number of enriched biological functions (#EBF). Corresponding to the differences in graph properties of gene networks, we also explore their biological significance. A commonly used method for this is functional enrichment analysis, which identifies biological functions, pathways, and molecular activities that are overrepresented within a gene set when compared to a random selection of genes with a similar size and degree distribution from the genome. In our study, we performed Gene Ontology (GO) enrichment analysis using the R package clusterProfiler (Ashburner et al., 2000a), which leverages data from databases such as KEGG and GO to identify enriched biological terms. A greater degree of enrichment suggests that the network exhibits more meaningful gene interactions than would be expected by chance. This unique enrichment across subtypes implies that the gene networks represent biologically significant interactions, where genes within specific cancer subtype networks are functionally connected as a group. To evaluate the functional diversity between two gene networks, we conducted an experiment using GO to count the number of unique GO terms associated with the genes in each network. Specifically, we used the enrichGO() function from clusterProfiler to map the genes from both networks to their corresponding GO terms. The compareCluster() function was applied to compare the sets of GO terms associated with each network and to identify differences, focusing on the number of enriched biological functions. To quantify the differences, we calculated the number of enriched biological functions (#EBF) using the symmetric difference between the sets of GO terms. Mathematically, this is represented as:
#EBF = (GO(G 1 ) \ GO(G 2 )) ∪ (GO(G 2 ) \ GO(G 1 )
). This operation captures the unique functions present in one network but not another. Enrichment was evaluated based on statistical significance, where the biological functions with a p-value < 0.05 were reported. A higher #EBF indicates that the networks capture different biological processes or molecular functions, potentially reflecting the underlying biological differences between the networks' contexts.

Section: G GO FUNCTION ENRICHMENT ANALYSIS
G.1 GO (GENE ONTOLOGY) AND BIOLOGICAL PROCESS Gene Ontology (GO) (Ashburner et al., 2000a) is a comprehensive and standardized framework for annotating genes based on their functions in biological contexts. It provides a structured vocabulary to describe the roles of genes in three broad categories:
• Biological Process (BP): This category defines the biological objectives or events that the gene products are involved in. BP terms focus on cellular functions that go awry in diseases like cancer.
It includes essential processes such as "signal transduction", "cell cycle regulation", "immune response", and "metabolic processes".
• Cellular Component (CC): This category defines the cellular locations where the gene products carry out their functions. CC terms focus on the spatial aspect of gene activity. Examples include terms like "cytosol", "nucleus", "mitochondrion", and "plasma membrane".
• Molecular Function (MF): This category outlines the biochemical activities of the gene product. MF terms focus on how genes contribute to cellular machinery on a molecular level. Examples include terms like "ATP binding", "enzyme activity", and "protein binding".
The Biological Process (BP) category is of particular relevance to cancer research because it directly reflects the underlying mechanisms that drive cancer development and progression and, therefore, serves as a suitable representative class for cancer-related GO terms (Desmedt et al., 2008).
Identifying disruptions in BP terms related to cancer mechanisms can also help guide therapeutic strategies. For instance, drugs targeting cell cycle checkpoints, such as cyclin-dependent kinase inhibitors, or those promoting apoptosis, like Bcl-2 family inhibitors, are being developed to specifically correct or mitigate BP-related dysfunctions (Shapiro, 2006;Ashkenazi et al., 2017).

Section: G.2 GO FUNCTION ENRICHMENT ANALYSIS RESULTS
Table 9 presents the enriched Gene Ontology (GO) terms associated with various biological functions across four cancer types (BRCA, GBM, LGG, and OV), as identified using different methods in the GO analysis of Experiment II. The Venn diagrams in Figure 3 illustrate the overlaps among the results from the different methods. Due to the complexity of comparing multiple methods, we present a four-way Venn diagram focusing on four selected methods (WGCNA, CSGNN, LR-GNN, and GeSubNet) for clarity.
GeSubNet findings: GeSubNet shows a robust array of enriched GO terms across different cancers, including:
• Apoptotic signaling pathway: A series of biochemical events leading to programmed cell death, which is essential for eliminating damaged or unwanted cells and maintaining tissue homeostasis. Dysregulation of apoptosis is a hallmark of cancer.
• Wnt signaling pathway: A network of proteins involved in cell signaling that regulates important processes such as cell proliferation, migration, and differentiation. Aberrant Wnt signaling is often implicated in cancer development.
• Tumor necrosis factor signaling: A signaling pathway that can induce inflammation, apoptosis, or cell survival, depending on the context. It is involved in various aspects of cancer biology, including tumorigenesis and immune response.
• Cell proliferation: The process by which cells divide and multiply, essential for growth and tissue repair. In cancer, deregulated cell proliferation leads to tumor growth and cancer progression.
This set of terms encompasses a range of crucial cancer-related biological functions shared by most cancers. This indicates that the resulting gene network maintains physical and biological meaningfulness, i.e., the backbone consists of genes involving the main cancer progression.

Section: Comparison:
The proposed method identifies a broader range of distinct GO terms compared to other methods, and the GO term set identified by GeSubNet constitutes a superset of the terms determined by different methods.
For instance, in BRCA, WGCNA and CSGNN identify terms primarily focusing on cell cycle regulation, DNA repair, and apoptosis. wTO and ARACNe report similar functionalities with notable overlaps. GAERF and LR-GNN overlap more with the proposed method but still do not capture as many terms as the proposed method. The proposed method's overlap with other approaches is significant, particularly regarding core cancer pathways such as DNA repair (present in all methods), Cell cycle arrest (common in most methods), and Apoptotic signaling pathways (reported by several methods). However, the proposed method finds unique terms, such as immune diseases in BRCA, 

Section: H SIMULATED GENE KNOCKOUT EXPERIMENT
H.1 WORKFLOW
Step 1: The simulation begins by ranking all genes based on node degree disparities calculated from the connectivity matrices of the sub-networks. Node degree is quantified as the number of direct connections each gene has to other genes within the network, serving as a measure of its centrality and influence across different cancer subtypes. To derive the connectivity matrices, we analyze the interactions between genes, where each gene is represented as a node and each interaction as an edge. The degree of each node is then computed to identify highly interconnected genes.
Step 2: After ranking, we categorize the genes into two sets: a high-ranking gene set, which includes genes exhibiting the largest degree disparities (above a defined threshold based on node degree variance), and a low-ranking gene set, composed of genes with minimal degree differences (below the same threshold). Using node degree variance as a threshold ensures our classification is statistically grounded. This method isolates genes that play critical roles in the network dynamics.
Step 3: Next, we individually simulate the knockout of genes within the high-ranking and lowranking gene sets. This process involves transforming their expression values to a baseline nonexpression level, which is defined as either zero or a predefined low expression value (such as the mean expression level of the lowest 10% of genes). This transformation mimics the functional loss of these genes. For each gene target in the selected sets, we systematically replace its expression value in the patient samples with the baseline non-expression level.
To ensure robustness and statistical validity, we repeat the simulations multiple times, typically running each simulation for a predetermined number of iterations (e.g., 100 or 1000). Each simulation involves the random selection of a subset of genes from the respective gene set. For the random selection, we define the number of genes to be included in each subset based on a fraction of the total genes in the gene set. For instance, we set p(select) to 10%, which means we select 10% of the genes from the high-ranking gene set and 10% from the low-ranking gene set in each iteration. This approach allows us to assess the impact of knocking out varying combinations of genes while maintaining a consistent sample size across runs. The random selection is performed using a uniform sampling technique to ensure that each gene has an equal chance of being included in the knockout simulation for that run. After each knockout simulation, we record the changes in patient distributions regarding the Shift Rate (SR).

Section: H.2 SHIFT RATE
Shift Rate: The shift rate measures the likelihood of sample groups shifting significantly after a set of genes is knocked out. It accounts for the average distance between samples within a patient group before and after the knockout and compares this distance to an adaptive threshold based on the spread (standard deviation) of samples. Let the distance between a sample within a given group before gene knockout, denoted as x before i , and after gene knockout, denoted as x after i , be expressed as ∥x before i -x after i ∥. The spread of samples within the group before knockout in knockout test j is quantified by the standard deviation σ j of their distances to the centroid of the before group. The shift rate (SR) is defined as:
∆ SR = 1 m m j=1 1 n n i=1 ∥x before i -x after i ∥ > k • σ j
Where m is the total number of knockout tests, n is the number of samples within the group, σ j is the standard deviation of the distances between the samples before knockout and the centroid of the group in knockout test j, and k is a scaling factor (e.g., 1.0 or 1.5) used to determine the threshold for considering a shift.

Section: I CASE STUDY I.1 BREAST INVASIVE CARCINOMA
Breast Invasive Carcinoma, commonly called BRCA, holds a significant position in cancer research due to its prevalence and clinical importance (Sharma et al., 2010). BRCA represents the most common form of breast cancer, accounting for a substantial portion of cancer-related morbidity and mortality worldwide. Moreover, it is a heterogeneous disease with diverse molecular subtypes, each with distinct clinical behaviors and therapeutic responses. This molecular complexity and clinical diversity make it an ideal candidate for investigating gene networks and deciphering the intricacies of cancer biology. Therefore, in cancer studies, BRCA serves as a cornerstone. Insights gained from BRCA research have huge implications for cancer biology and precision oncology, extending beyond breast cancer to other malignancies.
-BRCA Subtypes. Within the used BRCA dataset are various molecular subtypes (patient groups). They are identified based on distinct genetic alterations and clinical features. These subtypes include luminal A, luminal B, HER2-enriched, basal-like, and normal-like subtypes, each characterized by specific gene expression patterns and clinical behaviors (Orrantia-Borunda et al., 2022). We give a brief overview of these subtypes:
• Luminal A: This subtype is characterized by the expression of estrogen receptor (ER) and/or progesterone receptor (PR) and low levels of the HER2 protein. Luminal A tumors typically have a favorable prognosis and are often responsive to hormone-based therapies.
• Luminal B: Luminal B tumors also express ER and/or PR but may have higher proliferation markers such as Ki-67 levels (Sobecki et al., 2017). They can be divided into luminal B HER2positive (ER/PR-positive, HER2-positive) and luminal B HER2-negative (ER/PR-positive, HER2negative) subtypes. Luminal B tumors generally have a poorer prognosis compared to luminal A tumors.
• HER2-enriched: HER2-enriched tumors overexpress the HER2 protein without expressing hormone receptors (ER/PR-negative, HER2-positive). They are typically aggressive and associated with a higher risk of recurrence. Targeted therapies directed against HER2, such as trastuzumab (Herceptin), are often effective in treating HER2-enriched tumors.
• Basal-like: Basal-like tumors are characterized by the absence of hormone receptors (ER/PRnegative) and HER2 amplification (HER2-negative). They often display features similar to basal/myoepithelial cells of the mammary gland and are associated with a poor prognosis. Basallike tumors are frequently referred to as "triple-negative" (Chacón & Costanzo, 2010) breast cancers due to the lack of expression of ER, PR, and HER2.
• Normal-like: Normal-like tumors have gene expression profiles resembling normal breast tissue. They are less common and less well-defined than other subtypes, and their clinical significance is not fully understood.

Section: I.2 EXPERIMENTS AND ANALYSIS PROTOCOLS
We briefly introduce the background and application cases of the experiments and analysis protocols in the biological validation.
-Gene Expression Distribution Analysis. This analysis involves examining the distribution of gene expression levels across different experimental conditions or patient groups to visualize the distribution of expression levels for genes. This analysis has been extensively used in cancer research to explore the expression patterns of key oncogenes and tumor suppressor genes across different cancer types and stages. In studies of cancer patients, researchers may compare the expression distributions of oncogenes and tumor suppressor genes between tumor samples and adjacent normal tissue samples. Differences in expression distributions may indicate dysregulation of these genes in cancer. For instance, gene expression distribution analysis was employed to investigate the expression levels of TP53, a well-known tumor suppressor gene, in various cancer types (Olivier et al., 2010). This analysis revealed significant alterations in the distribution of TP53 expression in different cancer cohorts, showing its potential role as a diagnostic or prognostic marker in malignancies.
-Differential Gene Expression Analysis. Differential gene expression analysis has been a cornerstone of transcriptomic studies. This analysis compares gene expression levels between different experimental conditions or sample groups to identify significantly upregulated or downregulated genes. Statistical tests such as t-tests or non-parametric tests are commonly used. For example, cancer patients' and healthy controls' gene expression profiles can be compared to identify dysregulated genes in cancer. Genes with significant differences in expression levels may be further investigated as potential biomarkers or therapeutic targets. For example, researchers performed differential gene expression analysis on RNA-seq data from Alzheimer's disease patients and healthy controls (Twine et al., 2011). This analysis identified a panel of differentially expressed genes implicated in neuroinflammation and synaptic dysfunction, showing molecular pathways associated with Alzheimer's disease progression.

Section: J ABLATION STUDIES
In this section, we introduce the ablation studies. We designed the ablations and model variants for each module. This is to verify the effectiveness of the proposed method's core concepts across a diverse set of deep structures and training strategies.
Firstly, we executed experiments utilizing various deep generative models to learn sample embeddings in the sample embedding learning module. The experiment comprised the following model variants:
• GeSubNet-VAE: It uses basic VAE to learn sample embeddings by performing clustering tasks on patient samples.
• GeSubNet-VQVAE: It uses VQ-VAE to learn sample embeddings by performing clustering tasks on patient samples.
• GeSubNet-GAN: It incorporates a GAN structure on top of a basic AE. This model performs sample augmentation while performing clustering tasks on patient samples.
Next, in the gene embedding learning module, we conducted experiments using various graph neural network models to learn gene embeddings. The experiment included the following model variants:
• GeSubNet-GCN: A variant utilizes GCN to learn gene embeddings through the link prediction task.
• GeSubNet-GAT: A variant utilizes GAT to learn gene embeddings through the link prediction task.
Finally, in the ablation study on the gene network inference module, we experimented and included the following model variants:
• GeSubNet-OneStep: A variant removes the entire module and substitutes it with a one-step model.
• GeSubNet-Conca: Another one-step variant contains an additional neural layer that uses concatenated sample embeddings and gene embeddings for network classification tasks.

Section: ACKNOWLEDGMENT
This work was supported by JSPS KAKENHI Grants-in-Aid for Scientific Research Number 24K20778 and 25K03231, JST CREST JPMJCR23M3, NSF award SCH-2205289, SCH-2014438, IIS-2034479, and AIST policy-based budget projects of "Research DX Platforms". This paper is based on results obtained from the project, "Research and Development Project of the Enhanced infrastructures for Post 5G Information and Communication Systems" (JPNP20017), commissioned by the New Energy and Industrial Technology Development Organization (NEDO).
Ablation. We conduct three detailed ablation studies to evaluate the impact of each module in GeSubNet. More details can be found in Appendix J. Figure 7 presents the results of the three ablation studies across all variant models. For Patient-M, the proposed sample encoder significantly outperforms all other DGM models across the four network inference tasks (BRCA, GBM, LGG, and OV). For instance, the proposed method achieves an average improvement of 32.3%/31.2%/22.1%/32.3% in terms of GED. The Graph-M ablations show that the method using Neo-GNN consistently performs best, while the other GNN models yield comparable results. For Infer-M ablation, GeSubNet significantly outperforms the other objective functions, achieving approximately twice the metric values of its counterparts.

Section: K PRIOR GRAPH V.S. NEWLY GENERATED GRAPH
We evaluated the performance of patient group learning by inputting the newly generated graph from the GeSubNet into a plain GCN and comparing the results. Figure 8 presents a UMAP visualization of the learned latent sample spaces, with the prior graph initialization (Left) and the generated graph GCN initialization (Right). The left sub-figure shows that different patient groups appear mixed in the latent sample space derived from the prior gene network. However, there are clearer boundaries between various patient groups, as shown on the right side. Such results confirm the redundancy of information in the common prior gene networks. It demonstrates that the GeSubNet provides more structured information and potential for cancer studies.


References:
[b0]  (2023). Financial burden of cancer care. 
[b1] Athena Aktipis; Randolph M Nesse (2013). Evolutionary foundations for cancer biology. Evolutionary applications
[b2] Michael Ashburner; Catherine A Ball; Judith A Blake; David Botstein; Heather Butler; J Michael Cherry; Allan P Davis; Kara Dolinski; Selina S Dwight; Janan T Eppig (2000). Gene ontology: tool for the unification of biology. Nature genetics
[b3] Michael Ashburner; Catherine A Ball; Judith A Blake; David Botstein; Heather Butler; J Michael Cherry; Allan P Davis; Kara Dolinski; Selina S Dwight; Janan T Eppig (2000). Gene ontology: tool for the unification of biology. Nature genetics
[b4] Avi Ashkenazi; Wayne J Fairbrother; Joel D Leverson; Andrew J Souers (2017). From basic apoptosis discoveries to advanced selective bcl-2 family inhibitors. Nature reviews drug discovery
[b5] Allan Balmain; Joe Gray; Bruce Ponder (2003). The genetics and genomics of cancer. Nature genetics
[b6] Aviv Bergman; Mark L Siegal (2003). Evolutionary capacitance as a general feature of complex gene networks. Nature
[b7] Alvis Brazma; Jaak Vilo (2000). Gene expression data analysis. FEBS letters
[b8] D Reinaldo;  Chacón; V María;  Costanzo (2010). Triple-negative breast cancer. Breast cancer research
[b9] Zheng Chen; Ziwei Yang; Lingwei Zhu; Peng Gao; Takashi Matsubara; Shigehiko Kanaya; Md Altaf-Ul-Amin (2023). Learning vector quantized representation for cancer subtypes identification. Computer Methods and Programs in Biomedicine
[b10] Zheng Chen; Lingwei Zhu; Ziwei Yang; Takashi Matsubara (2023). Automated cancer subtyping via vector quantization mutual information maximization. 
[b11] Lin Cheng; Qiulian Wu; Zhi Huang; Olga A Guryanova; Qian Huang; Weinian Shou; Jeremy N Rich; Shideng Bao (2011). L1cam regulates dna damage checkpoint response of glioblastoma stem cells through nbs1. The EMBO journal
[b12] Christine Desmedt; Benjamin Haibe-Kains; Pratyaksha Wirapati; Marc Buyse; Denis Larsimont; Gianluca Bontempi; Mauro Delorenzi; Martine Piccart; Christos Sotiriou (2008). Biological processes associated with breast cancer clinical outcome depend on the molecular subtypes. Clinical cancer research
[b13] Deborah A Forst; Brian V Nahed; Jay S Loeffler; Tracy T Batchelor (2014). Low-grade gliomas. The oncologist
[b14] Xinbo Gao; Bing Xiao; Dacheng Tao; Xuelong Li (2010). A survey of graph edit distance. Pattern Analysis and applications
[b15] Kwang-Il Goh; Michael E Cusick; David Valle; Barton Childs; Marc Vidal; Albert-László Barabási (2007). The human disease network. Proceedings of the National Academy of Sciences
[b16] Kwang-Il Goh; Michael E Cusick; David Valle; Barton Childs; Marc Vidal; Albert-László Barabási (2007). The human disease network. 
[b17] Nicolas Goossens; Shigeki Nakagawa; Xiaochen Sun; Yujin Hoshida (2015). Cancer biomarker discovery and validation. Translational cancer research
[b18] Allison P Robert L Grossman; Vincent Heath; Harold E Ferretti;  Varmus; Warren A Douglas R Lowy; Louis M Kibbe;  Staudt (2016). Toward a shared vision for cancer genomic data. New England Journal of Medicine
[b19] Deisy Morselli; Gysi ; Andre Voigt; Tiago De; Miranda Fragoso; Eivind Almaas; Katja Nowick (2018). wto: an r package for computing weighted topological overlap and a consensus network with integrated visualization tool. BMC bioinformatics
[b20] Wei Da; Brad T Huang; Richard A Sherman;  Lempicki (2009). Systematic and integrative analysis of large gene lists using david bioinformatics resources. Nature protocols
[b21] Elise C Gordon C Jayson; Henry C Kohn; Jonathan A Kitchener;  Ledermann (2014). Ovarian cancer. The lancet
[b22] Minoru Kanehisa; Susumu Goto (2000). Kegg: kyoto encyclopedia of genes and genomes. Nucleic acids research
[b23] Minoru Kanehisa; Miho Furumichi; Yoko Sato; Yuriko Matsuura; Mari Ishiguro-Watanabe (2024). Kegg: biological systems database as a model of the real world. Nucleic Acids Research
[b24] Chuanze Kang; Han Zhang; Zhuo Liu; Shenwei Huang; Yanbin Yin (2022). Lr-gnn: A graph neural network based on link representation for predicting molecular associations. Briefings in Bioinformatics
[b25] Danai Koutra; Joshua T Vogelstein; Christos Faloutsos (2013). Deltacon: A principled massive-graph similarity function. SIAM
[b26] Peter Langfelder; Steve Horvath (2008). Wgcna: an r package for weighted correlation network analysis. BMC bioinformatics
[b27] Sean E Lawler; Maria-Carmela Speranza; Choi-Fong Cho; E Antonio; Chiocca  (2017). Oncolytic viruses in cancer treatment: a review. JAMA oncology
[b28] Evan Jeffrey T Leek; Hilary S Johnson; Andrew E Parker; John D Jaffe;  Storey (2012). The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics
[b29] Menglu Li; Zhiwei Wang; Luotao Liu; Xuan Liu; Wen Zhang (2024). Subgraph-aware graph kernel neural network for link prediction in biological networks. IEEE Journal of Biomedical and Health Informatics
[b30] Peng Liang; Arthur B Pardee (2003). Analysing differential gene expression in cancer. Nature Reviews Cancer
[b31] Donna Maglott; Jim Ostell; Kim D Pruitt; Tatiana Tatusova (2010). Entrez gene: gene-centered information at ncbi. Nucleic acids research
[b32] Ilya Adam A Margolin; Katia Nemenman; Chris Basso; Gustavo Wiggins; Riccardo Dalla Stolovitzky; Andrea Favera;  Califano (2006). Aracne: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC bioinformatics
[b33] Henry Jessica N Mcalpine; Martin Porter; Brad H Köbel; Leah M Nelson; Steve E Prentice; Janine Kalloger; Katy Senz; Jiarui Milne;  Ding; P Sohrab;  Shah (2012). Brca1 and brca2 mutations correlate with tp53 abnormalities and presence of immune cell infiltrates in ovarian high-grade serous carcinoma. Modern Pathology
[b34] Julie A Christopher J Mungall; Sebastian Mcmurry; James P Köhler; Charles Balhoff; Matthew Borromeo; Seth Brush; Tom Carbon; Nathan Conlin; Mark Dunn;  Engelstad (2017). The monarch initiative: an integrative data and analytic platform connecting phenotypes to genotypes across species. Nucleic acids research
[b35] Magali Olivier; Monica Hollstein; Pierre Hainaut (2010). Tp53 mutations in human cancers: origins, consequences, and clinical use. Cold Spring Harbor perspectives in biology
[b36] Erasmo Orrantia-Borunda; Patricia Anchondo-Nuñez; Evelia Lucero; Francisco Acuña-Aguilar; Claudia Adriana Ramírez- Octavio Gómez-Valles;  Valdespino (2022). Subtypes of breast cancer. 
[b37] Huaxin Pang; Shikui Wei; Zhuoran Du; Yufeng Zhao; Shengxing Cai; Yao Zhao (2024). Graph representation learning based on specific subgraphs for biomedical interaction prediction. IEEE/ACM Transactions on Computational Biology and Bioinformatics
[b38] Typhaine Paysan-Lafosse; Matthias Blum; Sara Chuguransky; Tiago Grego; Beatriz Lázaro Pinto; Gustavo A Salazar; Maxwell L Bileschi; Peer Bork; Alan Bridge; Lucy Colwell (2023). Interpro in 2022. Nucleic acids research
[b39] Nataša Pržulj (2007). Biological network comparison using graphlet degree distribution. Bioinformatics
[b40] Davis J Mark D Robinson; Gordon K Mccarthy;  Smyth (2010). edger: a bioconductor package for differential expression analysis of digital gene expression data. bioinformatics
[b41] Geoffrey I Shapiro (2006). Cyclin-dependent kinase pathways as targets for cancer treatment. Journal of clinical oncology
[b42] N Ganesh; Rahul Sharma; Jyotsana Dave; Piush Sanadya;  Sharma;  Sharma (2010). Various types and management of breast cancer: an overview. Journal of advanced pharmaceutical technology & research
[b43] Michal Sobecki; Karim Mrouj; Jacques Colinge; Philippe Franc ¸ois Gerbe; Liliana Jay; Vjekoslav Krasinska; Daniel Dulic;  Fisher (2017). Cell-cycle regulation accounts for variability in ki-67 expression levels. Cancer research
[b44] T Alicia; Jun Specht;  Li (2017). Leap: constructing gene co-expression networks for single-cell rnasequencing data using pseudotime ordering. Bioinformatics
[b45] Damian Szklarczyk; Andrea Franceschini; Stefan Wyder; Kristoffer Forslund; Davide Heller; Jaime Huerta-Cepas; Milan Simonovic; Alexander Roth; Alberto Santos; P Kalliopi;  Tsafou (2015). String v10: protein-protein interaction networks, integrated over the tree of life. Nucleic acids research
[b46] Damian Szklarczyk; Rebecca Kirsch; Mikaela Koutrouli; Katerina Nastou; Farrokh Mehryary; Radja Hachilif; Annika L Gable; Tao Fang; T Nadezhda; Sampo Doncheva;  Pyysalo (2023). The string database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic acids research
[b47] T Hastie; R Narasimhan B Tibshirani; Chu G  (2022). impute: Imputation for microarray data. R package version
[b48] The Cancer; Genome Atlas; Research Network (2013). Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature
[b49] Natalie A Twine; Karolina Janitz; Marc R Wilkins; Michal Janitz (2011). Whole transcriptome sequencing reveals gene expression and splicing differences in brain regions affected by alzheimer's disease. PloS one
[b50] Kaja Urbańska; Justyna Sokołowska; Maciej Szmidt; Paweł Sysa (2014). Glioblastoma multiforme-an overview. Contemporary Oncology/Współczesna Onkologia
[b51] Aaron Van Den; Oriol Oord;  Vinyals (2017). Neural discrete representation learning. Advances in neural information processing systems
[b52] Danila Vella; Italo Zoppis; Giancarlo Mauri; Pierluigi Mauri; Dario Di Silvestre (2017). From proteinprotein interactions to protein co-expression networks: a new perspective to evaluate large-scale proteomic data. EURASIP Journal on Bioinformatics and Systems Biology
[b53] Eloise Withnell; Xiaoyu Zhang; Kai Sun; Yike Guo (2021). Xomivae: an interpretable deep learning model for cancer classification using high-dimensional omics data. Briefings in Bioinformatics
[b54] Qing-Wen Wu; Jun-Feng Xia; Jian-Cheng Ni; Chun-Hou Zheng (2021). Gaerf: predicting lncrnadisease associations by graph auto-encoder and random forest. Briefings in bioinformatics
[b55] Tianzhi Wu; Erqiang Hu; Shuangbin Xu; Meijun Chen; Pingfan Guo; Zehan Dai; Tingze Feng; Lang Zhou; Wenli Tang;  Li Zhan (2021). clusterprofiler 4.0: A universal enrichment tool for interpreting omics data. The innovation
[b56] Bo Yang; Ting-Ting Shan-Min; Meng Pang; Yi-Jie Wang;  Wang (2021). Deep Subspace Mutual Learning for cancer subtypes prediction. Bioinformatics
[b57] Hai Yang; Rui Chen; Dongdong Li; Zhe Wang (2021). Subtype-gan: a deep learning approach for integrative cancer subtyping of multi-omics data. Bioinformatics
[b58] Ziwei Yang; Zheng Chen; Yasuko Matsubara; Yasushi Sakurai (2023). Moclim: Towards accurate cancer subtyping via multi-omics contrastive learning with omics-inference modeling. 
[b59] Seongjun Yun; Seoyoon Kim; Junhyun Lee; Jaewoo Kang; Hyunwoo J Kim (2021). Neo-gnns: Neighborhood overlap-aware graph neural networks for link prediction. Advances in Neural Information Processing Systems
[b60] Naif Zaman; Lei Li; Maria Luz Jaramillo; Zhanpeng Sun; Chabane Tibiche; Myriam Banville; Catherine Collins; Mark Trifiro; Miltiadis Paliouras; Andre Nantel (2013). Signaling network assessment of mutations and copy number variations predict breast cancer subtype-specific drug targets. Cell reports
[b61] Bin Zhang; Steve Horvath (2005). A general framework for weighted gene co-expression network analysis. Statistical applications in genetics and molecular biology
[b62] Lin Zhang; Wei Zhou; Victor E Velculescu; Scott E Kern; Ralph H Hruban; Bert Stanley R Hamilton; Kenneth W Vogelstein;  Kinzler (1997). Gene expression profiles in normal and cancer cells. Science
[b63] Chengshuai Zhao; Shuai Liu; Feng Huang; Shichao Liu; Wen Zhang (2021). Csgnn: Contrastive selfsupervised graph neural network for molecular interaction prediction. 

Figures:
Figure fig_0: 1
Type: figure
Caption: Figure 1 :1Figure 1: An example illustrating the mismatch issue in cancer gene networks. The BRCA gene network from the STRING database shows general interactions across various subtypes. Although a gene set with consistent behavior leads to the discovery of a sub-network, this sub-network cannot be directly linked to specific subtypes, such as Luminal A, Luminal B, or Basal-like.
Data: 

Figure fig_1: 2
Type: figure
Caption: Figure 2 :2Figure 2: Overview: Step 1: Patient-M sets up an unsupervised learning task to generate the patient sample representation (Z p ) from the input gene expression data (X), which can distinguish subtypes.Step 2: Graph-M sets up a link prediction task to train the GNN encoder and decoder, learning the graph representation (Z g ) from the input gene graph (G) and expression data (X).Step 3: Infer-M uses an objective function that integrates representations to generate subtype-specific networks. The reconstruction from Patient-M, conditioned on the GNN training in Graph-M (q θ (z g |G)), refines the graph structure, while ensuring accurate patient profile reconstruction (p ϕ (x|x)).
Data: 

Figure fig_2: 3
Type: figure
Caption: Figure 3 :3Figure 3: The Venn diagrams illustrate the overlap in GO terms resulting from different methods (WGCNA, CSGNN, LR-GNN, and GeSubNet) across four cancers. Shared and unique function items are listed here. A full list is provided in Appendix G. We highlight some unique function items that are well-supported by biological evidence in bold.
Data: 

Figure fig_3: 4
Type: figure
Caption: Figure 4 :4Figure 4: (a) UMAP visualization of an example showing patient distribution before and after the simulated gene knockout for a target subtype. The gray points in the main figure represent the negative control groups (subtypes). The small figures at the bottom left represent the original distributions of different subtypes. In the right subfigure, high-ranking genes are knocked out, while in the left, low-ranking genes are knocked out. (b) Table: shift rates (∆ SR ) on knocking out high-and low-ranking genes, found by different baselines. The best results are highlighted in bold.
Data: 

Figure fig_4: 56
Type: figure
Caption: Figure 5 :Figure 6 :56Figure 5: The obtained gene networks for two BRCA patient groups (subtypes).
Data: 

Figure fig_5: 
Type: figure
Caption: norm factors <-calcNormFactors(filtered data) normalized data <-cpm(filtered data, log=FALSE, normalized.lib.sizes=TRUE) (4) Batch Effect Correction: To minimize batch effects arising from non-uniform experimental protocols, the 'ComBat' function from the SVA package (Leek et al., 2012) was applied to remove unwanted variation across different platforms and projects. corrected data <-ComBat(dat=normalized data, batch=batch info)
Data: 

Figure tab_1: 1
Type: table
Caption: Summary of gene expression profile data and gene network data for four cancer types.
Data: CancerGene Expression MatrixGene NetworkKnowledge Databases#Subtypes #Features #Patients #Nodes #Edges KE ST Int MonaBRCA511327638146868✓✓✓✓GBM511273416102203✓✓✓LGG311124451103345✓✓✓OV411324291109159✓✓

Figure tab_2: 2
Type: table
Caption: Baseline comparison results on GED, DCS, and CDV for the proposed and baselines. GED, DCS, and CDV are subjected to min-max normalization. The best-performing results are highlighted in bold. The second-best results are highlighted in underline.
Data: 

Figure tab_3: 3
Type: table
Caption: The comparison results on #EBF between GeSubNet and the baselines. Only biological functions with high statistical significance (pvalue < 0.05) are reported.
Data: Method#EBF(↑)BRCA GBM LGG OVWGCNA5322wTO4422ARACNe4412LEAP3323GAERF5332LR-GNN6433CSGNN3544GeSubNet8665

Figure tab_4: :
Type: table
Caption: Shift rates (∆ SR ) on knocking out highand low-ranking genes across different methods
Data: 

Figure tab_5: 4
Type: table
Caption: .1 Gene Expression Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 B.2 Preprocessing of Gene Expression data . . . . . . . . . . . . . . . . . . . . . . . . 16 B.3 Gene Network Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 B.4 Preprocessing of Gene Network Data . . . . . . . . . . . . . . . . . . . . . . . . . 17 the mathematical notations and explanations used in the paper are summarized in Table4. Mathematical notations and explanations.
Data: C Baselines

Figure tab_6: 
Type: table
Caption: Table 6 describes the details of all experimental cancer gene network datasets. KEGG, STRING, InterPro, and Monarch are abbreviated as KE, ST, Int, and Mona, respectively.
Data: 

Figure tab_7: 5
Type: table
Caption: Descriptions of four cancer gene expression datasets.
Data: CancerRaw TranscriptomicsGene Expression Matrix#Gene #Patient #GroupSample sizeFeature sizeBRCA 195376385{320, 124, 119, 54, 21}11327GBM174554165{125, 111, 80, 68, 32}11273LGG162454513{213, 151, 87}11124OV172262914{81, 76, 68, 66}11324

Figure tab_8: 6
Type: table
Caption: Descriptions of the four cancer gene network datasets.
Data: CancerData Source KE ST Int Mona#Node#Edge (Type I)#Edge (Type II)BRCA✓✓✓✓146289579GBM✓✓✓10275128LGG✓✓✓103206139OV✓✓1094695C BASELINES

Figure tab_9: 7
Type: table
Caption: Hyperparameter sensitivity experiment. The best-performing results are highlighted in bold, and the checkmark indicates our choice of the optimal settings.
Data: Hyperparameters etrics GED (↑) DCS (↓) CDV (↑)Latent Dim = 160.760.490.74Latent Dim = 32 (✓)0.780.470.75Latent Dim = 640.790.480.73#Code Book = 160.720.510.68#Code Book = 32 (✓)0.780.470.75#Code Book = 640.750.540.63Batch Size = 160.760.480.73Batch Size = 32 (✓)0.780.470.75Batch Size = 640.770.490.68(CSGNN)

Figure tab_10: 8
Type: table
Caption: The computational requirements of the proposed method, including both runtime (training and inference time) and memory usage, across different cancer datasets.
Data: DatasetTraining Time (sec) Inference Time (sec) GPU Memory Usage (MB)BRCA6.23 ± 0.032.21 ± 0.054393 ± 152GBM3.64 ± 0.041.44 ± 0.072764 ± 117LGG5.96 ± 0.031.98 ± 0.043834 ± 122OV3.45 ± 0.041.25 ± 0.042583 ± 108Pan-cancer454.47 ± 12.2142.86 ± 5.215893 ± 733

Figure tab_11: 9
Type: table
Caption: Detailed enriched GO terms across four cancer tasks resulting from different methods. signaling in GBM, and the Notch signaling pathway in LGG. They are absent in other method's results, yet evidence has proven their relevance to cancers.
Data: OVDNA repair,Apoptotic signaling pathwayRegulation of cell migration,DNA repairDNA repair,Apoptotic signaling pathwayDNA repair,Apoptotic signaling pathway,Cell proliferationDNA repair,Apoptotic signaling pathwayDNA repair,Apoptotic signaling pathway,Cell proliferationDNA repair,Apoptotic signaling pathway,Wnt signaling pathway,Cell proliferationDNA repair,Apoptotic signaling pathway,Wnt signaling pathway,Tumor necrosis factor signaling,Cell proliferationLGGWnt signaling pathway,Regulation of cell migrationWnt signaling pathway,DNA repairWnt signaling pathwayRegulation of cell migration,Wnt signaling pathwayCell cycle arrest,Wnt signaling pathway,Regulation of cell migrationCell cycle arrest,Wnt signaling pathway,DNA repairCell cycle arrest,Apoptotic signaling pathway,Wnt signaling pathway,DNA repairCell cycle arrest,Apoptotic signaling pathway,Wnt signaling pathway,Notch signaling pathway,Tumor necrosis factor signaling,Cell proliferationGBMCell cycle arrest,DNA damage response,Apoptotic signaling pathwayDNA damage response,Apoptotic signaling pathway,Tumor necrosis factor signaling,Wnt signaling pathwayDNA damage response,Tumor necrosis factor signaling,Wnt signaling pathway,Cell proliferationDNA damage response,Apoptotic signaling pathway,Wnt signaling pathwayDNA damage response,Wnt signaling pathway,Cell proliferationWnt signaling pathway,DNA damage response,Apoptotic signaling pathway,Cell proliferationDNA damage response,Apoptotic signaling pathway,Wnt signaling pathway,Cell proliferation,Tumor necrosis factor signalingDNA damage response,Apoptotic signaling pathway,Wnt signaling pathway,Tumor necrosis factor signaling,Cell proliferation,DNA damage checkpoint signalingBRCACell cycle arrest,DNA repair,Apoptotic signaling pathway,Regulation of cell migration,Wnt signaling pathwayDNA repair,Cell cycle arrest,Apoptotic signaling pathway,Regulation of cell migrationDNA repair,Cell cycle arrest,Apoptotic signaling pathway,Regulation of cell migrationCell cycle arrest,DNA repair,Wnt signaling pathwayDNA repair,Cell cycle arrest,Wnt signaling pathway,Apoptotic signaling pathway,Regulation of cell migrationDNA repair,Cell cycle arrest,Wnt signaling pathway,Apoptotic signaling pathway,Regulation of cell migration,DNA damage responseDNA repair,Cell cycle arrest,Apoptotic signaling pathwayDNA repair,Cell cycle arrest,Apoptotic signaling pathway,Wnt signaling pathway,Regulation of cell migration,Immune diseases,Tumor necrosis factor signaling,Cell proliferationMethodWGCNAwTOARACNeLEAPGAERFLR-GNNCSGNNProposed


Formulas:
Formula formula_0: x (m) = {x (m) 1 , x (m) 2 , • • • , x (m) N }. Let Y = {y 1 , y 2 , • • • , y |Y| }

Formula formula_1: X ′ = MLP node ( j i=1 MLP edge (A ij ), X)

Formula formula_2: L(ϕ; x) := -E q ϕ (ze|x) [log p ϕ (x|z q )](1)

Formula formula_3: L(θ; ω) := - 1 E E i=1

Formula formula_4: )) + (1 -h e ) log(1 -ĥe (ω; z e (θ)))](2)

Formula formula_5: X = Z p • Z g T .

Formula formula_6: L(θ; x) = -E q θ (zg|G) [log p ϕ (x|x)](3)

Formula formula_7: Method BRCA GBM LGG OV CDV (↑) GED (↑) DCS (↓) CDV (↑) GED (↑) DCS (↓) CDV (↑) GED (↑) DCS (↓) CDV (↑) GED (↑) DCS (↓)

Formula formula_8: ∆ SR = 1 T t=1 1 n n i=1 ∥x before i -x after i ∥ > k • σ t (4)

Formula formula_9: GED(N 1 , N 2 ) = min π (u,v)∈π c(u, v).

Formula formula_10: GED(N 1 , N 2 , . . . , N n ) = 1 n(n-1) n i=1 n j=1,j̸ =i GED(N i , N j ).

Formula formula_11: DCS(G 1 , G 2 ) = 1 -1 2 N i=1 N j=1 1 N N i=1 (I G1 (i, j) -I G2 (i, j))

Formula formula_12: : CDV = √ 1 N N i=1 (ki-k) 2 k√ N × 1 k .

Formula formula_13: #EBF = (GO(G 1 ) \ GO(G 2 )) ∪ (GO(G 2 ) \ GO(G 1 )

Formula formula_14: ∆ SR = 1 m m j=1 1 n n i=1 ∥x before i -x after i ∥ > k • σ j
