Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Track: long paper (4–8 pages excluding references)
Keywords: Single-Cell Foundation Models, Single-Cell RNA Sequencing, Masked Language Modeling, Inductive Bias in Biology, Representation Learning, Self-Supervised Learning
TL;DR: Addressing a limitation of random masking to capture meaningful representations in single-cell foundation models, we present CorrMask, a correlation-based masking strategy that demonstrates better generalization on both gene and cell-level tasks.
Abstract: Many single-cell foundation models (scFMs) learn representations of cellular identity through masked modeling of gene expression, yet standard random masking treats genes as independent tokens, a poor match for the modular, co-regulated structure of gene regulatory networks. In this work, we show that this mismatch enables shortcut learning: a model may reconstruct masked genes from locally correlated partners rather than capturing global cellular state, yielding representations that underserve underrepresented cell populations.
We introduce CorrMask, a data-driven masking strategy that constructs a gene dependency graph from expression covariance and masks correlated gene groups jointly, forcing the model to rely on higher-order biological context. Evaluating on tissue-specific corpora, CorrMask produces representations that improve cell type annotation, particularly for underrepresented populations, and gene-level generalization, while matching standard baselines with up to $3{\times}$ less pre-training data. Our results suggest that meaningful single-cell representations require pre-training objectives that respect the dependency structure of the transcriptome.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 60
Loading