scRAG: an Efficient Retrieval Augmented Generation System for scRNA-seq Data Analysis

Published: 2025, Last Modified: 21 Jan 2026ICDE 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: An average person usually contains more than 10 trillion human cells, and each cell's transcriptome can be profiled by a single-cell RNA sequence (scRNA-seq). The huge volume and high complexity of scRNA-seq data put challenges on fundamental tasks of scRNA-seq data analysis, i.e., cell type identification and new cell type discovery. In this paper, we demonstrate scRAG, which can efficiently remove batch effect in cell-type identification and enable reliable new cell discovery, facilitated by GPU-based scRNA-seq data management and Large Language Models (LLMs). The GPU-based scRNA-seq data management enables high throughput scRNA-seq data retrieval and update, while the LLM utilizes the retrieval results to remove the batch effect and discover novel cells. We demonstrate scRAG for its: (a) interfaces, (b) GPU-based scRNA-seq data management, and (c) applications in batch effect removal and cancer cell discovery.
Loading