A long range foundation model for zero-shot predictions in single-cell and spatial transcriptomics data

Ameya Joshi; Raphael Boige; Lee Zamparo; Ugo Tanielian; Juan Jose Garau-Luis; Michail Chatzianastasis; Priyanka Pandey; Janik Sielemann; Alexander Seifert; Martin Brand; Maren Lang; Karim Beguir; Thomas PIERROT

A long range foundation model for zero-shot predictions in single-cell and spatial transcriptomics data

Ameya Joshi, Raphael Boige, Lee Zamparo, Ugo Tanielian, Juan Jose Garau-Luis, Michail Chatzianastasis, Priyanka Pandey, Janik Sielemann, Alexander Seifert, Martin Brand, Maren Lang, Karim Beguir, Thomas PIERROT

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Foundation models, single-cell RNA-seq, spatial transcriptomics, masked language modelling, computational biology, zero shot inference

TL;DR: A foundation model for single-cell RNA-seq and spatial transcriptomics data that leverages gene expression sequences across multiple cells to learn to predict gene expressions for unseen cells in a zero-shot setting.

Abstract: Large transformers pre-trained with language model objectives have demonstrated success in multiple fields, and have tremendous potential for modeling single-cell RNA-seq and spatial transcriptomics data. However, these approaches are yet to overcome various challenges, including inductive biases that hinder generalization, artifacts and quality of the underlying data, as well as downstream evaluation pipelines that do not reflect the biological challenges in the field. In this work, we propose a new framework, sCellTransformer (sCT), that relies on a first principles formulation of the problem as well as a validation pipeline designed to evaluate models generalization through zero-shot predictions. sCT leverages a long-range convolutional-transformer architecture that is trained from unprocessed single-cell and spatial transcriptomics data. In contrast to previous works, sCT represents cells with up to 20,000 protein-coding genes, processes sets of multiple cells, and predicts about a million discretized gene expression tokens. We show that representing gene expression as discrete levels allows us to mitigate the high sparsity present in single-cell data both during training and evaluation. We present state-of-the-art empirical results on several zero-shot gene expression imputation, cell-typing, and clustering tasks in both single-cell as well as spatial domains, outperforming current foundation models.

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 11205

Loading