Towards a linearly organized embedding space of biological networks

Published: 01 Jan 2024, Last Modified: 05 Feb 2025undefined 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: English) The recent technological advances in high-throughput sequencing have yielded vast amounts of large-scale biological omics data that describe different aspects of cellular functioning. These omics data are typically modelled and analyzed as networks. Due to the high dimensionality of biological networks, embeddings are a cornerstone in the analysis of such complex networks. Embedding biological networks is challenging, as it involves capturing both topological (similar wiring patterns) and neighborhood-based similarity of the nodes. However, current network embedding algorithms do not preserve both types of similarity, which limits the information preserved in the embedding space. Moreover, the existing methods for analyzing the embedding space of molecular networks use the vectors of the biological entities as the input for computationally intensive ML models that aid downstream analysis tasks. In contrast, in the field of NLP, they mine the word embedding space directly by doing simple linear operations between the word embedding vectors. In this thesis, following the NLP paradigm, we mine biological knowledge directly from the embedding space, and we identify the properties of a space that make it suitable for linear operations. In network biology, Non-Negative Matrix Tri Factorization (NMTF) is extensively used to embed networks in a low-dimensional space because it is an explainable AI method that also enables the joint representation of different networks in a shared space. We demonstrate the power of the NMTF-based data integration in the context of COVID-19 by applying two integration frameworks to identify COVID-19 related genes and prioritize drugs for repurposing targeting their gene products. Our newly identified genes could not have been identified with either network-medicine or differential expression based approaches that rely on a single type of omic data. Then, to extract new biological knowledge based on linear operations, we introduce two NLP inspired network representations: the Positive Pointwise Mutual Information (PPMI) matrix and the Graphlet Degree Vector (GDV) PPMI matrix. The PPMI matrix captures the neighborhood-based similarities of the nodes based on random walks between adjacent nodes, while the GDV PPMI matrix, the topological ones by using random walks between similarly wired nodes, independent of being adjacent. As a showcase, we represent the nodes of the human PPI network with our GDV PPMI and PPMI matrix and generate the embedding spaces by factorizing these matrices with NMTF. We show that genes embedded close in these spaces have similar biological functions, so we can extract new biomedical knowledge directly by doing linear operations on their embedded vectors. We exploit this property to predict genes participating in protein complexes and to identify cancer-related genes based on the cosine similarities between the vector representations of the genes. We also go beyond embeddings that preserve one type of similarity by introducing novel random-walk based network embeddings that incorporate the graphlets (small, connected and induced subgraphs) into DeepWalk and LINE methods. We use the graphlets that leverage both topological and neighbourhood-based similarity as the context for the random walks. In a graphlet-based random walk, a node can visit any other nodes that simultaneously participate in the given graphlet. We show that in the graphlet-based representations of the networks, more adjacent nodes have the same label (i.e., the nodes are grouped in a more homophilic way) than in the standard random walk representations. Then we factorize these matrices with NMTF and show that the more homophilic the network representation, the more functionally organized is the corresponding embedding space, and thus the downstream analysis tasks are better. Our new graphlet-based methodologies embed networks in linear spaces, alleviating the need for computationally expensive ML methods.
Loading