A Diffusion-Based Autoencoder for Learning Patient-Level Representations from Single-Cell Data

Rebecca Boiarsky; Johann Wenckstern; Nicholas J Haradhvala; Gad Getz; David Sontag

A Diffusion-Based Autoencoder for Learning Patient-Level Representations from Single-Cell Data

Rebecca Boiarsky, Johann Wenckstern, Nicholas J Haradhvala, Gad Getz, David Sontag

Published: 11 Jun 2025, Last Modified: 18 Jul 2025GenBio 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: representation learning, autoencoder, transformer, diffusion, scRNA-seq, single cell, transcriptomics, clinical prediction

Abstract: Single-cell RNA sequencing (scRNA-seq) offers insights into cellular heterogeneity and tissue composition, yet leveraging this data for patient-level clinical predictions remains challenging due to the set-structured nature of single-cell data, as well as the scarcity of labeled samples. To address these challenges, we introduce scSet, a diffusion-based autoencoder that learns patient-level representations from sets of single-cell transcriptomes. Our method uses a transformer-based encoder to process variably sized and unordered cell inputs, coupled with a conditional diffusion decoder for self-supervised learning on unlabeled data. By pre-training on large-scale unlabeled datasets, scSet generates robust patient representations that can be fine-tuned for downstream clinical prediction tasks. We demonstrate the effectiveness of scSet patient embeddings for clinical prediction across multiple real-world datasets, where they outperform existing patient representations, even with limited labeled data. This work represents an important step toward bridging the gap between single-cell resolution and patient-level insights. Code is available at [https://github.com/clinicalml/scset](https://github.com/clinicalml/scset).

Submission Number: 117

Loading