Genes Are Not Words: Dependency-Aware Masking for Single-Cell Foundation Models

Published: 02 Mar 2026, Last Modified: 13 Mar 2026Gen² 2026 PosterTop10EveryoneRevisionsCC BY 4.0
Track: Full / long paper (5-8 pages)
Keywords: Single-Cell Foundation Models, Single-Cell RNA Sequencing, Masked Language Modeling, Inductive Bias in Biology, Representation Learning, Self-Supervised Learning
TL;DR: We present CorrMask, a strategy of masking correlated gene groups to prevent shortcut learning in single-cell foundation models, that achieves comparable performance to standard masking with up to 3x less data.
Abstract: Many single-cell foundation models (scFMs) learn representations of cellular identity by applying masked language modeling to gene expression data, yet the direct transfer from NLP imports an implicit independence assumption that conflicts with the modular, co-regulated structure of gene regulatory networks. In this work, we show that this domain mismatch enables shortcut learning: models reconstruct masked genes from locally correlated partners rather than encoding global cellular state, disproportionately harming underrepresented cell types, the rare and transitional populations most relevant to perturbation biology and target identification. We introduce CorrMask, a genomics-native masking strategy that constructs a gene dependency graph from expression covariance and masks correlated gene groups jointly, forcing the model to rely on higher-order biological context. Evaluating on tissue-specific corpora, CorrMask substantially improves annotation of underrepresented cell populations, enhances gene-level generalization via dosage sensitivity prediction, and matches standard baselines with up to $3\times$ less pre-training data, all without architectural changes or additional external data. Our results highlight gene co-regulation as a critical barrier to effective self-supervised learning in genomics, and demonstrate how the scFMs can benefit from domain-aware masking.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 39
Loading