Keywords: Computational Biology, Large Language Models
TL;DR: We introduce GAGCell, a versatile single-cell analysis framework that achieves a double-win in both cost-effectiveness and high performance.
Abstract: Single-cell foundation models (scFMs) are transforming computational biology by enabling generalizable, task-agnostic representations for versatile single-cell analysis. Despite their progress in facilitating rapid deployment for downstream tasks, off-the-shelf scFMs still have some overlooked concerns: (I) (\textit{Pretraining Cost.}) Pretrain-based scFMs necessitate pretraining on a vast volume of cells, rendering it draining resources in applications. (II) (\textit{Heterogeneous Gap.}) Large Language Models (LLM)-based scFMs ignore the tremendous heterogeneous gap between LLM textual and raw cellular spaces, leading to insufficient capability when facing downstream tasks. To this end, we introduce RAGCell, a versatile single-cell analysis framework that achieves a double-win in both \textbf{cost-effectiveness} and \textbf{high performance}. The success of RAGCell lies in two key aspects: 1. Leveraging LLMs to construct cell-level and feature-level knowledge databases, which serve as supervision signals for training the cell model and significantly reduce the training cost ($>$pretrain-based scFMs). 2. Aligning cell representations with text embeddings from the bi-level knowledge databases, enabling knowledge transfer from textual spaces to cellular spaces and effectively mitigating the heterogeneous gap ($>$LLM-based scFMs).
Through extensive experiments on six downstream single-cell analysis tasks, we demonstrate that RAGCell achieves outstanding performance compared to state-of-the-art scFMs while operating at less than $\sim$1/10 the cost of pretrain-based scFMs.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 1530
Loading