Keywords: Large Language Models, AI Virtual Cell
Abstract: High-throughput single-cell sequencing has enabled large-scale cellular profiling and spurred the development of single-cell foundation models. These models, typically pretrained on transcriptomic data, learn general-purpose cellular representations but remain limited in modality coverage, causal reasoning, and interpretability, thus falling short of the vision of an Artificial Intelligence Virtual Cell (AIVC). In parallel, large language models (LLMs) have demonstrated strong potential for unifying heterogeneous modalities, adapting to diverse tasks, and generating interpretable reasoning chains in natural language, making them promising candidates toward AIVC. Recent progress in applying LLMs to tasks such as cell annotation and perturbation prediction highlights this potential, yet key challenges persist, including insufficient task coverage, narrow evaluation metrics, and limited robustness to input and prompting factors. To address these gaps, we introduce \textbf{CeLLM}, a comprehensive benchmarking framework for evaluating \textbf{LLM}s in the \textbf{CeLL}ular domain. CeLLM covers a broad spectrum of tasks spanning gene, cell, and omics-level analyses, systematically assesses 15 open-source, proprietary, and biology-specialized models, and incorporates diverse evaluation criteria under multiple task settings. As a cross-scale, reproducible, and dynamic benchmark, CeLLM provides a sustainable platform to track progress, foster methodological innovation, and accelerate the development of LLMs toward virtual cell modeling.
Primary Area: datasets and benchmarks
Submission Number: 6289
Loading