Scaling Laws for Masked-Reconstruction Transformers on Single-Cell Transcriptomics

16 Feb 2026 (modified: 08 Apr 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Neural scaling laws -- power-law relationships between loss, model size, and data -- have been extensively documented for language and vision transformers, yet their existence in single-cell genomics remains largely unexplored. We present the first systematic study of scaling behaviour for masked-reconstruction transformers trained on single-cell RNA sequencing (scRNA-seq) data. Using expression profiles from the CELLxGENE Census, we study a canonical regime with 512 highly variable genes and 200,000 cells, alongside an exploratory comparison regime with 1,024 genes and 10,000 cells. Across seven model sizes spanning three orders of magnitude in parameter count (533 to 3.4 x 10^8 parameters), we fit the parametric scaling law L = aP^(-alpha) + c to validation mean squared error (MSE). The canonical V = 512, D = 200k regime exhibits clear power-law scaling on validation loss, and held-out test evaluation follows the same qualitative trend. By contrast, the original V = 1,024, D = 10k comparison does not provide a clean causal test of data scarcity because vocabulary size, dataset size, and training budgets all differ simultaneously; we therefore treat it as exploratory rather than definitive. We additionally report matched-V follow-up analyses, including a fixed-V = 512 data-size sweep, held-out test-set scaling, and an empirical check of cross-gene residual heterogeneity. Under a homoscedastic Gaussian approximation, the asymptotic floor in the canonical regime corresponds to approximately 2.3 bits of irreducible uncertainty remaining per masked gene position, not to a universal biological constant. We discuss implications for the design of single-cell foundation models and outline the additional matched sweeps and likelihood-based objectives needed to turn this preliminary quantity into a rigorous transcriptomic entropy estimate.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Lee_Zamparo1
Submission Number: 7546
Loading