sciLaMA: A Single-Cell Representation Learning Framework to Leverage Prior Knowledge from Large Language Models

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Single-cell RNA sequencing (scRNA-seq) enables high-resolution exploration of cellular diversity and gene regulation, yet analyzing such data remains challenging due to technical and methodological limitations. Existing task-specific deep generative models like Variational Auto-Encoder (VAE) and its variants struggle to incorporate external biological knowledge, while transformer-based foundational large Language Models (LLMs or large LaMs) face limitations in computational cost and applicability to tabular gene expression data. Here, we introduce sciLaMA (single-cell interpretable Language Model Adapter), a novel representation learning framework that bridges these gaps by integrating static gene embeddings from multimodal LaMs with scRNA-seq tabular data through a paired-VAE architecture. Our approach generates context-aware representations for both cells and genes and outperforms state-of-the-art methods in key single-cell downstream tasks, including batch effect correction, cell clustering, and cell-state-specific gene marker and module identification, while maintaining computational efficiency. sciLaMA offers a computationally efficient, unified framework for comprehensive single-cell data analysis and biologically interpretable gene module discovery.
Lay Summary: Understanding how our cells work is key to figuring out how tissues grow, develop, and change during disease. Researchers now apply advanced assays to study individual cells, but the data they produce can be noisy and difficult to interpret. Our method, called sciLaMA, helps make sense of this complex data by combining it with existing biological knowledge learned from large language models (LLMs), such as the AI chatbot we are using daily. By bringing this knowledge into the analysis, sciLaMA more accurately identifies different cell states, fills in missing data, and highlights key genes that influence how cells behave over time or during disease progression. This makes the process more efficient and reveals new insights that could guide the discovery of biomarkers and potential treatment targets.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/microsoft/sciLaMA
Primary Area: Applications->Health / Medicine
Keywords: Representation Learning, Large Language Model, Batch Effect Correction, Gene Module Discovery
Submission Number: 13101
Loading