Keywords: Bioinformatics, Systems Biology, Statistics, Similarity and Distance Learning, Big Data
TL;DR: We compiled the largest unified collection of single-cell perturbation datasets currently available and applied high-dimensional statistics to quantify similarities between perturbations in the datasets.
Abstract: Recent biotechnological advances led to growing numbers of single-cell studies, which reveal molecular and phenotypic responses to large numbers of perturbations. However, analysis across diverse datasets is typically hampered by differences in format, naming conventions, data filtering and normalization. To facilitate development and benchmarking of computational methods in systems biology, we collect a set of 44 publicly available single-cell perturbation-response datasets with molecular readouts, including RNA, proteins and chromatin accessibility (Figure Panel A). We apply uniform pre-processing and quality control pipelines and harmonize feature annotations. The resulting information resource enables efficient development and testing of computational analysis methods, and facilitates direct comparison and integration across datasets. 32 RNA datasets in this resource were perturbed using CRISPR and 9 were perturbed with drugs (Figure Panel B). We also include three scATAC datasets, as well as three CITE-seq datasets with protein and RNA counts separately downloadable. For each scRNA-seq dataset we supply count matrices, where each cell has a perturbation annotation, quality control metrics including gene counts and mitochondrial read percentage. Quality control plots for each dataset are also available on scperturb.org. Notably, more than 8000 CRISPR perturbations are shared across multiple datasets. We anticipate this data resource being useful for developing machine learning models for perturbation responses across datasets and other tasks.