['1c1', '< Title: GESUBNET: GENE INTERACTION INFERENCE FOR DISEASE SUBTYPE NETWORK GENERATION', '---', '> Title: GeSubNet: Gene Interaction Inference for Disease Subtype Network Generation', '3c3', '< Abstract: Retrieving gene functional networks from knowledge databases presents a challenge due to the mismatch between disease networks and subtype-specific variations. Current solutions, including statistical and deep learning methods, often fail to effectively integrate gene interaction knowledge from databases or explicitly learn subtype-specific interactions. To address this mismatch, we propose GeSubNet, which learns a unified representation capable of predicting gene interactions while distinguishing between different disease subtypes. Graphs generated by such representations can be considered subtype-specific networks. GeSubNet is a multi-step representation learning framework with three modules: First, a deep generative model learns distinct disease subtypes from patient gene expression profiles. Second, a graph neural network captures representations of prior gene networks from knowledge databases, ensuring accurate physical gene interactions. Finally, we integrate these two representations using an inference loss that leverages graph generation capabilities, conditioned on the patient separation loss, to refine subtype-specific information in the learned representation. GeSubNet consistently outperforms traditional methods, with average improvements of 30.6%, 21.0%, 20.1%, and 56.6% across four graph evaluation metrics, averaged over four cancer datasets. Particularly, we conduct a biological simulation experiment to assess how the behavior of selected genes from over 11,000 candidates affects subtypes or patient distributions. The results show that the generated network has the potential to identify subtype-specific genes with an 83% likelihood of impacting patient distribution shifts.', '---', '> Abstract: Understanding gene functional networks is fundamental to biomedical research, relying heavily on both comprehensive biological knowledge bases like STRING (Szklarczyk et al., 2023) and KEGG (Kanehisa et al., 2024), and rich experimental data such as patient gene expression profiles. A critical challenge arises from the inherent generalization of these knowledge bases, which often lack the specificity required to capture variations across distinct disease subtypes. Current methods struggle to effectively integrate gene interaction knowledge with subtype-specific variations, leading to misinterpretations of gene behaviors across different disease contexts.', '4a5,8', '> To bridge this critical gap, we introduce GeSubNet, a novel multi-step representation learning framework. GeSubNet learns a unified representation that accurately predicts gene interactions while explicitly distinguishing between different disease subtypes, thereby generating highly targeted subtype-specific networks. It achieves this through three integrated modules: a deep generative model for patient subtyping, a graph neural network for learning prior gene interactions, and a novel inference mechanism that unifies these representations to generate refined, subtype-specific networks.', '> ', '> GeSubNet consistently outperforms traditional and deep learning methods, demonstrating average improvements of 30.6%, 21.0%, 20.1%, and 56.6% across four graph evaluation metrics, averaged over four diverse cancer datasets. Furthermore, through a biological simulation experiment involving over 11,000 gene candidates, we show that the generated networks have the potential to identify subtype-specific genes with an 83% likelihood of impacting patient distribution shifts. This work significantly advances the generation of biologically meaningful, subtype-specific gene networks, offering new avenues for precision medicine.', '> ', '6,17c10,21', '< Biological knowledge bases such as STRING (Szklarczyk et al., 2023) and KEGG (Kanehisa et al., 2024), and wet-lab experimental datasets such as gene expression data are crucial for understanding disease-gene association. While the knowledge bases are comprehensive, they often lack specificity for disease subtypes. This work introduces a deep learning method to integrate general knowledge bases with disease-subtype-specific experimental data to create more targeted knowledge graphs.', '< Decades of research have generated extensive disease-gene association data, compiled into various biological knowledge databases (Goh et al., 2007b;Szklarczyk et al., 2023;Kanehisa & Goto, 2000). These databases integrate known and predicted gene interactions, forming gene functional networks that describe how gene behaviors relate to disease processes. They support disease research by interpreting experimental results (Vella et al., 2017), facilitating biomarker discovery (Yang et al., 2023), and enabling personalized treatment (Goossens et al., 2015). Besides general knowledge bases, there are also in-lab experimental data, such as patient gene expression profiles. These experiments filter candidate genes, and the interactions in databases supported by these candidates are considered more relevant to subtypes. However, a mismatch exists between generic knowledge bases and experimental data when studying disease subtypes. For instance, as shown in Figure 1, breast cancer comprises multiple subtypes (luminal A, luminal B, and Basal-like), but databases like STRING provide only a general gene network for all subtypes. This generalization can lead to misinterpretations of gene behaviors across subtypes.', '< While bio-researchers have proposed data generation approaches to construct meaningful subtypespecific networks (Zaman et al., 2013), they often require extensive in-lab analyses such as pair-wise gene examination among hundreds to thousands of gene candidates. This paper introduces a novel data-driven approach to address this mismatch, automating the integration of gene expression data and knowledge databases to directly generate gene functional networks for various disease subtypes.', '< Related Work. Existing methods for generating subtype gene networks can be categorized into two groups: statistical and deep learning-based methods. Statistical methods focus on speeding up gene filtering by mining experimental data. These methods employ similarity metrics to measure the correlation between genes. High correlations, such as co-expressed genes (Zhang & Horvath, 2005), are marked as functional interactions. For example, ARACNe (Margolin et al., 2006) uses mutual information to measure expression similarity and removes indirect links with low similarity. WGCNA (Langfelder & Horvath, 2008) calculates Pearson correlation to support large-scale comparisons, while wTO (Gysi et al., 2018) transforms the correlations into probabilistic measures. However, gene interaction retrieval still prioritizes genes of interest.', '< A few deep learning methods leverage both knowledge databases and experimental datasets. They form disease networks as graphs and embed gene expression data, containing different patient information, as node embeddings. They set up link prediction and reconstruction using graph neural networks (GNNs). The newly reconstructed graphs can be viewed as specific networks. Representative methods include GAERF (Wu et al., 2021a), which learns node features with a graph auto-encoder and then uses a random forest to predict links. CSGNN (Zhao et al., 2021) predicts gene interactions using both a mix-hop aggregator and a self-supervised GNN. LR-GNN (Kang et al., 2022) proposes a dynamic graph method to gradually reconstruct graph structure, mitigating the constraints of prior general disease network information. Recent works focus on improving the accuracy of gene-gene link prediction (Li et al., 2024;Pang et al., 2024). However, their objective is only to reconstruct general disease-gene associations, including irrelevant interactions. This approach does not explicitly learn the distinct gene interactions unique to disease subtypes.', '< Contributions and Novelty. We present a new solution for leveraging distinct subtype information from experimental data, i.e., gene expression profiles, to directly infer Gene interactions specific to disease Subtype Networks. This leads us to GeSubNet, which learns a unified representation that can accurately predict prior gene interactions while being able to distinguish different subtypes of a disease. Graphs generated by such representations can be considered subtype-specific networks.', '< GeSubNet is a multi-step learning framework with independent data representation learning and integration. The first step uses a deep generative model to learn gene expression representations. These representations capture distinct data distributions and can distinguish subtypes in a latent feature space. The second step employs a GNN to learn graph representations of prior gene networks. This step ensures GeSubNet captures true gene-gene functional interactions collected in knowledge databases. Finally, we integrate the two representations, updating graph representations and inferring subtype-specific gene interactions using a reconstruction loss on the gene expression data.', '< Our experiments confirm that GeSubNet can simultaneously generate different subtype networks within a general cancer. The contributions lie in:', '< • Formulating New Gene Problem. We first frame this problem as how to infer gene interactions can help models distinguish subtypes in experimental datasets. We investigate a method that automates the integration of gene expression data and knowledge databases, explicitly generating disease subtype networks.', '< • Proposing Automated Data Integration Methodology. GeSubNet is an effective architecture that combines a VQ-VAE and Neo-GNN, achieving average improvements of 30.6%, 21.0%, 20.1%, and 56.6% across three metrics on four cancer datasets. More advanced models can be easily integrated into GeSubNet.', '< • Impacting Broad Biological Relevance. We propose impactful biological evaluations and a new metric. The experiments involving 11,327 gene evaluations demonstrate that genes selected by GeSubNet are highly related to specific subtypes. We are the first to conduct a simulated experiment, termed Knock-out (Bergman & Siegal, 2003), to assess how the behavior of genes affects different subtypes. The proposed metric evaluates the reliability of selected gene interactions. The results show that GeSubNet effectively narrows down key genes.', '< • Integrated Datasets for Cancer Subtyping. We collect physical cancer-gene networks across four knowledge databases and construct machine-learning-ready datasets for experiments and evaluation. We release our datasets with this paper to support continued investigation. The code and data resources are available at: https://github.com/chenzRG/GeSubNet', '---', '> Understanding disease-gene associations is fundamental to biomedical research, relying heavily on both comprehensive biological knowledge bases like STRING (Szklarczyk et al., 2023) and KEGG (Kanehisa et al., 2024), and rich experimental data such as patient gene expression profiles. A critical challenge arises from the inherent generalization of these knowledge bases, which often lack the specificity required to capture variations across distinct disease subtypes. This paper addresses this gap by introducing a novel deep learning approach that effectively integrates generalized knowledge with subtype-specific experimental data to construct highly targeted and biologically meaningful knowledge graphs.', '> Decades of research have generated extensive disease-gene association data, compiled into various biological knowledge databases (Goh et al., 2007b;Szklarczyk et al., 2023;Kanehisa & Goto, 2000). These databases integrate known and predicted gene interactions, forming gene functional networks that describe how gene behaviors relate to disease processes. They support disease research by interpreting experimental results (Vella et al., 2017), facilitating biomarker discovery (Yang et al., 2023), and enabling personalized treatment (Goossens et al., 2015). Alongside these general knowledge bases, in-lab experimental data, such as patient gene expression profiles, offer crucial insights by filtering candidate genes whose interactions are more relevant to specific disease subtypes. However, a significant mismatch persists between the broad scope of generic knowledge bases and the granular detail of experimental data when studying disease subtypes. For instance, as illustrated in Figure 1, breast cancer encompasses multiple subtypes (e.g., Luminal A, Luminal B, and Basal-like), yet databases like STRING typically provide only a general gene network applicable to all subtypes. This generalization can lead to misinterpretations of gene behaviors and hinder the development of targeted therapies.', '> While bio-researchers have proposed data generation approaches to construct meaningful subtype-specific networks (Zaman et al., 2013), these often necessitate extensive in-lab analyses, such as laborious pair-wise gene examinations among hundreds to thousands of gene candidates. This paper introduces GeSubNet, a novel data-driven approach designed to automate the integration of gene expression data and knowledge databases, directly generating gene functional networks tailored for various disease subtypes. This automation significantly reduces the reliance on manual, labor-intensive experimental validation.', '> Related Work. Existing methods for generating subtype gene networks can be broadly categorized into two main groups: statistical and deep learning-based methods. Statistical methods primarily focus on accelerating gene filtering by mining experimental data. These approaches typically employ similarity metrics to quantify correlations between genes. High correlations, such as those observed in co-expressed genes (Zhang & Horvath, 2005), are often interpreted as functional interactions. For example, ARACNe (Margolin et al., 2006) utilizes mutual information to measure expression similarity and subsequently removes indirect links with low similarity. WGCNA (Langfelder & Horvath, 2008) calculates Pearson correlation to facilitate large-scale comparisons, while wTO (Gysi et al., 2018) transforms these correlations into probabilistic measures. Despite their utility, these statistical methods often prioritize genes of interest and may not fully capture the complex, multifaceted nature of gene interactions in a subtype-specific manner.', '> A growing number of deep learning methods leverage both knowledge databases and experimental datasets. These methods often represent disease networks as graphs and embed gene expression data, which contains diverse patient information, as node embeddings. They typically employ graph neural networks (GNNs) for tasks such as link prediction and graph reconstruction, where the newly reconstructed graphs are intended to represent specific disease networks. Representative methods include GAERF (Wu et al., 2021a), which learns node features using a graph auto-encoder and then employs a random forest for link prediction. CSGNN (Zhao et al., 2021) predicts gene interactions by combining a mix-hop aggregator with a self-supervised GNN. LR-GNN (Kang et al., 2022) proposes a dynamic graph method to gradually reconstruct graph structure, aiming to mitigate the constraints imposed by prior general disease network information. Recent works have also focused on improving the accuracy of gene-gene link prediction (Li et al., 2024;Pang et al., 2024). However, a common limitation of these approaches is their primary objective: to reconstruct general disease-gene associations, which often includes irrelevant interactions. Consequently, these methods do not explicitly learn or highlight the distinct gene interactions unique to specific disease subtypes, which is crucial for precision medicine.', '> Contributions and Novelty. We present GeSubNet, a novel solution specifically designed to leverage distinct subtype information from experimental data, i.e., gene expression profiles, to directly infer gene interactions specific to disease subtype networks. GeSubNet learns a unified representation that can accurately predict prior gene interactions while simultaneously distinguishing between different subtypes of a disease. The graphs generated by such representations are therefore truly subtype-specific networks.', '> GeSubNet operates as a multi-step learning framework, featuring independent data representation learning and a sophisticated integration mechanism. The first step involves a deep generative model to learn gene expression representations that capture distinct data distributions and effectively distinguish subtypes within a latent feature space. The second step employs a GNN to learn robust graph representations of prior gene networks, ensuring GeSubNet captures biologically accurate gene-gene functional interactions documented in knowledge databases. Finally, we integrate these two representations through a novel inference module, which updates graph representations and infers subtype-specific gene interactions using a reconstruction loss conditioned on the gene expression data.', '> Our extensive experiments confirm that GeSubNet can simultaneously generate highly differentiated subtype networks within a general cancer context. The key contributions of this paper are:', '> • Formulating a Novel Problem. We formally frame the problem of inferring gene interactions in a way that directly helps models distinguish subtypes in experimental datasets. This work introduces an automated method for integrating gene expression data and knowledge databases, explicitly generating disease subtype networks that are tailored to specific patient groups.', '> • Proposing an Automated Data Integration Methodology. GeSubNet offers an effective and innovative architecture that combines a Vector Quantized-Variational AutoEncoder (VQ-VAE) and Neo-GNN. This integration achieves significant average improvements of 30.6%, 21.0%, 20.1%, and 56.6% across three key metrics on four diverse cancer datasets. Furthermore, the modular design of GeSubNet allows for easy integration of more advanced models in the future.', '> • Demonstrating Broad Biological Relevance and Novel Evaluation. We propose impactful biological evaluations, including a new metric, to rigorously assess the generated networks. Experiments involving 11,327 gene evaluations robustly demonstrate that genes selected by GeSubNet are highly related to specific subtypes. We are the first to conduct a simulated experiment, termed "Knock-out" (Bergman & Siegal, 2003), to assess how the behavior of selected genes affects different subtypes. The proposed Shift Rate (∆ SR ) metric effectively evaluates the reliability of selected gene interactions, showing that GeSubNet significantly narrows down key genes with high biological significance.', '> • Integrated and Publicly Available Datasets for Cancer Subtyping. We have meticulously collected physical cancer-gene networks across four comprehensive knowledge databases and constructed machine-learning-ready datasets for both experimental validation and future research. We are releasing our datasets with this paper to support continued investigation and foster advancements in cancer subtyping. The code and data resources are publicly available at: https://github.com/chenzRG/GeSubNet', '403d406', '< ']
