What information is preserved in latent cell embeddings? A Benchmark for Single-Cell Reconstruction

Published: 04 Mar 2026, Last Modified: 11 Mar 2026ICLR 2026 Workshop LMRL PosterEveryoneRevisionsBibTeXCC BY 4.0
Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Track: long paper (4–8 pages excluding references)
Keywords: Single-cell RNA-seq, Representation Learning, Perturbation Modeling, Benchmarking
TL;DR: We formalize reconstruction as a key requirement for latent shift modeling and benchmark both classical and foundation-model single-cell embeddings using metrics spanning statistical fidelity and biological utility.
Abstract: Learning compressed representations of single-cell transcriptomics data has been instrumental in modeling biological and experimental shifts in cellular states. Most current methods embed cells into a low-dimensional representation, traverse the latent space in meaningful biological directions, and decode back to gene expression. Despite its importance, the choice of representation is typically treated as an implementation detail rather than a first-order modeling decision. However, if a latent space fails to capture biological information, the resulting cell embeddings may be biologically implausible, thus limiting modeling efficacy and downstream analysis. We therefore study a necessary requirement of representations for latent shift modeling: reconstruction, i.e., decoding latent representations back to gene expression profiles. Here, we present a systematic benchmark of reconstruction quality across widely used representation families (PCA, AEs, VAEs) and pretrained foundation-model embeddings augmented with trained decoders (scGPT, SCimilarity, STATE, scConcept). Across three datasets spanning perturbational and observational settings and different scales, we quantify reconstruction performance using both statistical fidelity and biological signal preservation, providing an empirical foundation for selecting representation schemes that retain connections to interpretable expression-based biological information.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 48
Loading