Machine learning derived embeddings of bulk multi-omics data enable clinically significant representations in a pan-cancer cohort

Sanjay Nagaraj; Zachary R McCaw; Theofanis Karaletsos; Daphne Koller; Anna Shcherbina

Machine learning derived embeddings of bulk multi-omics data enable clinically significant representations in a pan-cancer cohort

Sanjay Nagaraj, Zachary R McCaw, Theofanis Karaletsos, Daphne Koller, Anna Shcherbina

Published: 27 Oct 2023, Last Modified: 12 Jul 2025GenBio@NeurIPS2023 PosterEveryoneRevisionsBibTeX

Keywords: omics, co-embedding, multi-ome, variational autoencoder

TL;DR: Generative ML models can create rich, clinically relevant co-embeddings of unpaired ATAC and RNA bulk sequencing data within a large pan-cancer patient cohort.

Abstract: Bulk multiomics data provides a comprehensive view of tissue biology, but datasets rarely contain matched transcriptomics and chromatin accessibility data for a given sample. Furthermore, it is difficult to identify relevant genetic signatures from the high-dimensional, sparse representations provided by omics modalities. Machine learning (ML) models have the ability to extract dense, information-rich, denoised representations from omics data, which facilitate finding novel genetic signatures. To this end, we develop and compare generative ML models through an evaluation framework that examines the biological and clinical relevance of the underlying latent embeddings produced. We focus our analysis on pan-cancer multiomics data from a set of 21 diverse cancer metacohorts across three datasets. We additionally investigate if our framework can generate robust representations from oncology imaging modalities (i.e. histopathology slides). Our best performing models learn clinical and biological signals and show improved performance over traditional baselines in our evaluations, including overall survival prediction.

Submission Number: 62

Loading