Effective Biological Representation Learning by Masking Gene Expression

Published: 03 Mar 2026, Last Modified: 26 Apr 2026ICLR 2026 Workshop FM4Science PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Transcriptomics, RNAseq, single-cell, perturbations, genetics, transformers, masked autoencoders, SSL
TL;DR: A carefully curated RNA-seq pretraining dataset and biologically grounded transformer design outperform much larger transcriptomic foundation models.
Abstract: Transcriptomic foundation models frequently fail to outperform linear baselines despite being trained on massive RNA sequencing corpora exceeding tens of millions of cells. To investigate this, we present TxFM, a transformer masked autoencoder trained on DiverseRNA-1.4M, a novel dataset of 1.4 million bulk and single-cell samples we curated from public data. We demonstrate that data quality can outweigh scale: TxFM outperforms larger models trained on datasets up to 100 times larger. Using previously published benchmarks, we compare TxFM against 16 existing methods and achieve state-of-the-art zero-shot perturbation representation across three held-out cellular contexts and strong performance on single-cell clustering and classification tasks. Ablations show that our architecture enables effective transfer learning by integrating a high masking ratio with a library size-bounded Poisson objective and a rectified tanh activation to enforce output constraints and sparsity. We also show that TxFM’s learned gene-specific parameters recover known protein complexes and pathways without supervision. These results establish that curated pretraining and appropriate architectural priors can yield robust transcriptomic representations that generalize across biological contexts.
Submission Number: 36
Loading