Hiding in Plain Sight: Visible Gene Correlations Undermine Single-Cell Representations

Alon Hacohen; Joseph Charles Bingham; Binyamin Perets; Dvir Aran

Hiding in Plain Sight: Visible Gene Correlations Undermine Single-Cell Representations

Alon Hacohen, Joseph Charles Bingham, Binyamin Perets, Dvir Aran

Published: 04 Mar 2026, Last Modified: 11 Mar 2026ICLR 2026 Workshop LMRL PosterEveryoneRevisionsBibTeXCC BY 4.0

Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.

Track: long paper (4–8 pages excluding references)

Keywords: Single-Cell Foundation Models, Single-Cell RNA Sequencing, Masked Language Modeling, Inductive Bias in Biology, Representation Learning, Self-Supervised Learning

TL;DR: Addressing a limitation of random masking to capture meaningful representations in single-cell foundation models, we present CorrMask, a correlation-based masking strategy that demonstrates better generalization on both gene and cell-level tasks.

Abstract: Many single-cell foundation models (scFMs) learn representations of cellular identity through masked modeling of gene expression, yet standard random masking treats genes as independent tokens, a poor match for the modular, co-regulated structure of gene regulatory networks. In this work, we show that this mismatch enables shortcut learning: a model may reconstruct masked genes from locally correlated partners rather than capturing global cellular state, yielding representations that underserve underrepresented cell populations. We introduce CorrMask, a data-driven masking strategy that constructs a gene dependency graph from expression covariance and masks correlated gene groups jointly, forcing the model to rely on higher-order biological context. Evaluating on tissue-specific corpora, CorrMask produces representations that improve cell type annotation, particularly for underrepresented populations, and gene-level generalization, while matching standard baselines with up to $3{\times}$ less pre-training data. Our results suggest that meaningful single-cell representations require pre-training objectives that respect the dependency structure of the transcriptome.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 60

Loading