Position: Saturation in Single-Cell Foundation Model Benchmarks Signals Identifiability Failure, Not Solved Capability
Keywords: single-cell foundation models, benchmark saturation, item response theory, identifiability, multi-omics, position paper
TL;DR: scFM benchmark saturation collapses Fisher information exponentially. P(correct top scFM) drops 76% to 35% as pass-rate grows 0.5 to 0.95. Need IRT-based diagnostic standard.
Abstract: Single-cell foundation models (scFMs) including scGPT (Cui et al. 2024), Geneformer (Theodoris et al. 2023), and scFoundation now report cell-type classification accuracies above 85-95% on standard benchmarks (Hou et al. 2026). The community routinely interprets clustering of top scFMs near these accuracies as evidence that "the benchmark is solved" or that "differences between scFMs are negligible." We argue both interpretations are wrong. Saturation is, structurally, an identifiability-failure signal: as observed pass-rates compress against the upper bound, the Fisher information on the latent capability of each scFM collapses exponentially, and the data become uninformative about pairwise capability contrasts. We support the claim with a simulation modeling 13 scFMs on 200 benchmark cells: as mean pass-rate rises from 0.50 to 0.95, the probability that the empirically-best scFM is also the truly-best scFM drops from 0.76 to 0.35, and Fisher information at p_hat=0.95 is only 18% of its maximum at p_hat=0.5 (theta=3 vs theta=0). Bayesian posterior probabilities Pr(theta_top > theta_second | y) degrade similarly. We propose a four-item identifiability-aware scFM benchmarking standard: pass-rate distribution disclosure, Fisher information reporting, posterior pairwise comparison, and saturation-triggered redesign protocol.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 90
Loading