Defining and benchmarking open problems in single-cell analysis

Malte D Luecken, Scott Gigante, Daniel Dimitrov, Yvan Saeys, Fabian J Theis, Smita Krishnaswamy

Published: 30 Jun 2025, Last Modified: 05 May 2026Nature BiotechnologyEveryoneRevisionsCC BY 4.0

Abstract: Single-cell genomics has enabled the study of biological processes at an unprecedented scale and resolution. These studies were enabled by innovative data generation technologies coupled with emerging computational tools specialized for single-cell data. As single-cell technologies have become more prevalent, so has the development of new analysis tools, which has resulted in over 1,700 published algorithms1 (as of February 2024). Thus, there is an increasing need to continually evaluate which algorithm performs best in which context to inform best practices2,3 that evolve with the field. In many fields of quantitative science, public competitions and benchmarks address this need by evaluating state-of-the-art methods against known criteria, following the concept of a common task framework4. Here, we present Open Problems, a living, extensive, community-guided platform including 12 current single-cell tasks that we envisage raising standards for the selection, evaluation and development of methods in single-cell analysis. In single-cell genomics, as in many other domains, it is typical for analysis algorithms to be evaluated using benchmarks. However, such benchmarks are often of limited use as the field suffers from a lack of standardized procedures for benchmarking5, leading to different assessments of the same method and producing different outcomes. Bespoke benchmarks set up by method developers to evaluate newly developed algorithms often include datasets and metrics chosen to highlight the advantages of their tools, which has been shown to lead to less objective assessments6,7. Even if datasets and metrics are standardized, historical analysis shows that when benchmarks are implemented by the same groups introducing new methods, the evaluations tend to inflate performance of the newest models via custom hyperparameter selection and data processing8. To provide more uniform and neutral assessment, groups can perform specialized benchmarking studies independently of method development. Tools such as registered reports, which promote neutrality of benchmarking results by design, have recently gained in popularity to enable such studies. These efforts aim to systematically evaluate the current state of the art in a given area and may be less biased. However, their results are static and inevitably age. These frameworks are typically not designed for extensibility or interoperability, limiting the value of reusing a framework to perform additional systematic benchmarks5. This inability to reuse infrastructure leads to repeats of non-standardized benchmarks that cannot provide the guidance that users need. For example, at least four benchmarks of batch integration methods exist9,10,11,12, each of which uses different sets of datasets and metrics and suggests different optimal methods (Fig. 1a). Similar issues have been reported across other single-cell topics, where datasets and metrics typically have less than 10% overlap between benchmarks13.