Abstract: As scRNA-seq expands dataset richness and advances cellular biology, transformer-based single-cell foundation models have emerged. Despite their efficacy, they face two key challenges: (1) Biological Perspective: They overlook the unordered nature of gene expression and fail to incorporate pathway priors essential for capturing functional interactions. (2) Computational Perspective: The high dimensionality of gene tokens leads to excessive tokenization and a lack of an efficient mechanism for handling long-context sequences. Driven by these challenges, we propose a novel Single-cell Pre-trained Language Model via Genetic Pathway Learning, named scPaLM, that addresses these challenges through three key innovations: (1) Permutation-Invariant Embeddeding where we handle high numer of genes via patching technique at the same time keeping permuation-invaraince; (2) a Genetic Pathway Learning module that is designed to learn discrete representations, enabling the modeling of collective gene behaviors in a data-driven way; (3) Cell-level Information Aggregation that progressively aggregates cell representations into a designated token during the training phase, with a tailored masking strategy and a token-level contrastive regularizer. Comprehensive evaluations across four biological benchmarks demonstrate scPaLM's superiority: it achieves average $\mathbf{10.1\\\%}$ improvement in cell type annotation compared to scGPT across all datasets, $\mathbf{5.15\\\%}$ increase in drug response prediction correlation compared to scFoundation.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: pre-training, continual learning, contrastive learning, word embeddings
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data analysis
Languages Studied: english
Submission Number: 5590
Loading