Abstract: Many single-cell foundation models (scFMs) learn representations of cellular identity through masked modeling of gene expression, yet standard random masking treats genes as independent tokens, a poor match for the modular, co-regulated structure of gene regulatory networks. We show that this mismatch lets models take the easy way out: reconstructing masked genes from locally correlated partners rather than learning the mechanistic cell-state features and capturing global cellular states. This shortcut learning disproportionately harms underrepresented cell types, rare and transitional populations relevant to disease biology and target identification, which lack the data redundancy to overcome correlation-driven gradients. We introduce CorrMask, a data-driven masking strategy that constructs a gene dependency graph from expression covariance and masks correlated gene groups jointly, forcing the model to rely on higher-order biological context. Evaluating on tissue-specific corpora, CorrMask substantially improves annotation of underrepresented cell populations, enhances gene-level generalization via dosage sensitivity prediction, and matches standard baselines with up to $3\times$ less pre-training data. Our results demonstrate the advantages of pre-training objectives that respect internal structure in regards to yielding representations that capture biological mechanism rather than statistical artifact.
Submission Number: 52
Loading