The Right Inductive Bias for the Job: Dependency-Aware Masking for Scientific Foundation Models

Published: 03 Mar 2026, Last Modified: 26 Apr 2026ICLR 2026 Workshop FM4Science PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Single-Cell Foundation Models, Single-Cell RNA Sequencing, Masked Language Modeling, Inductive Bias in Biology, Representation Learning, Self-Supervised Learning
TL;DR: Using gene correlations as an inductive bias, we present CorrMask, a masking scheme for single-cell foundation models to address limitations of random masking, and demonstrate it as a performance booster on both gene- and cell-level downstream tasks.
Abstract: Foundation models for science are increasingly built by adapting self-supervised objectives from natural language processing, yet scientific data often violates the assumptions these objectives encode. We study this tension in single-cell transcriptomics, where masked language modeling treats genes as independent tokens despite the modular, co-regulated structure of gene regulatory networks. We show that this mismatch introduces a harmful inductive bias: models reconstruct masked genes from locally correlated partners rather than learning global cellular state, a form of shortcut learning that disproportionately harms underrepresented cell populations. To address this, we introduce CorrMask, a data-driven masking strategy that constructs a gene dependency graph from expression covariance and masks correlated gene groups jointly, encoding domain structure directly into the pre-training objective. Evaluating on tissue-specific corpora, CorrMask substantially improves annotation of rare cell populations, enhances gene-level generalization via dosage sensitivity prediction, and matches standard baselines with up to $3\times$ less pre-training data, all without architectural changes. Our results illustrate a broader principle for scientific foundation models: when the data exhibits structured feature dependencies, the masking strategy becomes a first-class inductive bias that must be designed with domain awareness.
Submission Number: 85
Loading