PerturBERT: Learning Gene Co-Variation Embeddings from Perturbation Signatures

Published: 02 Mar 2026, Last Modified: 08 May 2026MLGenX 2026 TinypapertrackEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Current foundation models for transcriptomic data are typically trained in a self-supervised manner to predict masked gene expression values within a sample given other genes, thereby learning gene co-variation patterns from observational data. However, many translational applications require understanding how gene expression changes in response to interventions. We introduce PerturBERT, an encoder-only transformer pre-trained with masked-gene modeling on ~1M perturbation signatures across 248 cell lines that learns perturbational co-variance patterns from gene perturbation responses. PerturBERT tokenizes each signature as a set of (downstream gene, response) pairs and produces gene embeddings contextualized by their response to interventions. PerturBERT gene embeddings achieve state-of-the-art results on a gene embedding benchmark and gene dependency prediction. To our knowledge, PerturBERT is the first transformer explicitly pre-trained on gene perturbation responses, providing complementary representations to models trained on observational gene expression profiles.
Submission Number: 46
Loading