\texttt{scPaLM}: Pre-training of Single-cell Language Models through Genetic Pathway Learning

\texttt{scPaLM}: Pre-training of Single-cell Language Models through Genetic Pathway Learning

ACL ARR 2025 February Submission5590 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: As scRNA-seq expands dataset richness and advances cellular biology, transformer-based single-cell foundation models have emerged. Despite their efficacy, they face two key challenges: (1) Biological Perspective: They overlook the unordered nature of gene expression and fail to incorporate pathway priors essential for capturing functional interactions. (2) Computational Perspective: The high dimensionality of gene tokens leads to excessive tokenization and a lack of an efficient mechanism for handling long-context sequences. Driven by these challenges, we propose a novel Single-cell Pre-trained Language Model via Genetic Pathway Learning, named scPaLM, that addresses these challenges through three key innovations: (1) Permutation-Invariant Embeddeding where we handle high numer of genes via patching technique at the same time keeping permuation-invaraince; (2) a Genetic Pathway Learning module that is designed to learn discrete representations, enabling the modeling of collective gene behaviors in a data-driven way; (3) Cell-level Information Aggregation that progressively aggregates cell representations into a designated token during the training phase, with a tailored masking strategy and a token-level contrastive regularizer. Comprehensive evaluations across four biological benchmarks demonstrate scPaLM's superiority: it achieves average $\mathbf{10.1\\\%}$ improvement in cell type annotation compared to scGPT across all datasets, $\mathbf{5.15\\\%}$ increase in drug response prediction correlation compared to scFoundation.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: pre-training, continual learning, contrastive learning, word embeddings

Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data analysis

Languages Studied: english

Submission Number: 5590

Loading