A Diffusion-Based Autoencoder for Learning Patient-Level Representations from Single-Cell Data

Published: 11 Jun 2025, Last Modified: 18 Jul 2025GenBio 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: representation learning, autoencoder, transformer, diffusion, scRNA-seq, single cell, transcriptomics, clinical prediction
Abstract: Single-cell RNA sequencing (scRNA-seq) offers insights into cellular heterogeneity and tissue composition, yet leveraging this data for patient-level clinical predictions remains challenging due to the set-structured nature of single-cell data, as well as the scarcity of labeled samples. To address these challenges, we introduce scSet, a diffusion-based autoencoder that learns patient-level representations from sets of single-cell transcriptomes. Our method uses a transformer-based encoder to process variably sized and unordered cell inputs, coupled with a conditional diffusion decoder for self-supervised learning on unlabeled data. By pre-training on large-scale unlabeled datasets, scSet generates robust patient representations that can be fine-tuned for downstream clinical prediction tasks. We demonstrate the effectiveness of scSet patient embeddings for clinical prediction across multiple real-world datasets, where they outperform existing patient representations, even with limited labeled data. This work represents an important step toward bridging the gap between single-cell resolution and patient-level insights. Code is available at [https://github.com/clinicalml/scset](https://github.com/clinicalml/scset).
Submission Number: 117
Loading