SC-Arena: A Natural Language Benchmark for Single-Cell Reasoning with Knowledge-Augmented Evaluation

Jiahao Zhao; Feng Jiang; Shaowei Qin; Zhonghui Zhang; Junhao Liu; Guibing Guo; Hamid Alinejad-Rokny; Min Yang

SC-Arena: A Natural Language Benchmark for Single-Cell Reasoning with Knowledge-Augmented Evaluation

Jiahao Zhao, Feng Jiang, Shaowei Qin, Zhonghui Zhang, Junhao Liu, Guibing Guo, Hamid Alinejad-Rokny, Min Yang

Published: 26 Jan 2026, Last Modified: 27 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Single-cell Foundation Models, Scientific AI Benchmark, Knowledge-augmented Evaluation

Abstract: Large language models (LLMs) are increasingly applied in scientific research, offering new capabilities for knowledge discovery and reasoning. In single-cell biology, however, evaluation practices for both general and specialized LLMs remain inadequate: existing benchmarks are fragmented across tasks, adopt formats such as multiple-choice classification that diverge from real-world usage, and rely on metrics lacking interpretability and biological grounding. We present **SC-ARENA**, a natural language evaluation framework tailored to single-cell foundation models. SC-ARENA formalizes a *virtual cell* abstraction that unifies evaluation targets by representing both intrinsic attributes and gene-level interactions. Within this paradigm, we define five natural language tasks — cell type annotation, captioning, generation, perturbation prediction, and scientific QA — that probe core reasoning capabilities in cellular biology. To overcome the limitations of brittle string-matching metrics, we introduce **knowledge-augmented evaluation**, which incorporates external ontologies, marker databases, and scientific literature to support biologically faithful and interpretable judgments. Experiments and analysis across both general-purpose and domain-specialized LLMs demonstrate that: (i) under the *Virtual Cell* unified evaluation paradigm, current models achieve uneven performance on biologically complex tasks, particularly those demanding mechanistic or causal understanding; and (ii) our knowledge-augmented evaluation framework ensures biological correctness, provides interpretable, evidence-grounded rationales, and achieves high discriminative capacity, overcoming the brittleness and opacity of conventional metrics. **SC-ARENA** thus provides a unified and interpretable framework for assessing LLMs in single-cell biology, pointing toward the development of biology-aligned, generalizable foundation models.

Primary Area: datasets and benchmarks

Submission Number: 19203

Loading