Elucidating the Design Space of Generative Models for Single-Cell Perturbation Prediction

Sanjukta Bhattacharya; Christian Gensbigler; Shaamil Karim

Elucidating the Design Space of Generative Models for Single-Cell Perturbation Prediction

Sanjukta Bhattacharya, Christian Gensbigler, Shaamil Karim

Published: 28 May 2026, Last Modified: 28 May 2026GenBio 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: single cell gene expression, latent space, diffusion models, discrete state space

TL;DR: A new SOTA architecture: discrete-latent perturbation model learning to generate single-cell data

Abstract: We introduce $\texttt{ExpressionVAE}$, the first discrete-latent perturbation model for single-cell data: a vector-quantized variational autoencoder paired with a perturbation-conditioned discrete prior. On Replogle and Parse~1M it achieves state-of-the-art on every distributional and cell-eval state metric we evaluate, with order-of-magnitude gaps on Fr\'echet distance and $\mathrm{MMD}^2$ over the strongest continuous-latent baseline. We test two prior families (autoregressive and masked discrete diffusion) and find they achieve effectively identical numbers, isolating the gain to the discrete latent. A controlled output-head ablation further reveals a single design axis governing decoder-head choice, the richness of the inference-time sampling distribution, with standard evaluation metrics partitioning into three groups whose rankings flip along it. Finally, on a held-out CRISPRi reversion benchmark of $1{,}732$ perturbations under inflammatory cytokine stress, the frozen encoder effectively matches scGPT model (trained on $10\times$ larger dataset) on biological selectivity.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 222

Loading