Pre-training of Single-cell Language Models through Genetic Pathway Learning

Published: 17 Jun 2024, Last Modified: 17 Jun 2024AccMLBio PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: scRNA-seq; foundation model;
Abstract: The utilization of state-of-the-art single-cell RNA sequencing (scRNA-seq) techniques has significantly enhanced the depth and richness of scRNA-seq datasets, contributing to a more comprehensive comprehension of cellular biology and facilitating advancements across a spectrum of research domains. In this work, we propose a novel $\textbf{S}$ingle-$\textbf{c}$ell Pre-trained $\textbf{L}$anguage $\textbf{M}$odel via Genetic $\textbf{Pa}$thway Learning, named scPaLM, that effectively harnesses scRNA-seq data and enables various downstream applications. scPaLM integrates several innovative designs: ($1$) an embedding process that adeptly represents gene information with a reduced token count, enhancing computational efficiency; ($2$) a genetic pathway learning module that is designed to learn discrete representations, enabling the modeling of collective gene behaviors in a data-driven way; ($3$) an innovative training methodology that progressively aggregates cell representations into a designated token during the training phase, with a tailored masking strategy and a token-level contrastive regularizer. scPaLM demonstrates superior performance on various downstream tasks, including cell type annotations, imputation, and cancer drug response prediction, by clear margins compared to baselines. Codes will be made public.
Submission Number: 5
Loading