Position: Saturation in Single-Cell Foundation Model Benchmarks Signals Identifiability Failure, Not Solved Capability
Keywords: single-cell foundation models, benchmark saturation, identifiability, item response theory, position paper
TL;DR: scFM benchmark saturation collapses Fisher information exponentially: P(correct top scFM) drops from 0.75 to 0.41 as mean pass-rate grows 0.50 to 0.95 on a 9-scFM, 200-cell panel. We propose a four-item IRT-based diagnostic standard.
Abstract: Single-cell foundation models (scFMs) including scGPT, Geneformer, and scFoundation now report cell-type classification accuracies of 0.85 to 0.95 on standard benchmarks. The community routinely interprets clustering of top scFMs near these accuracies as evidence that the benchmark is solved, or that differences among top scFMs are negligible. We argue both interpretations are wrong. Saturation is, structurally, an identifiability-failure signal: as observed pass-rates compress against the upper bound, the Fisher information on each scFM's latent capability collapses exponentially, and the data become uninformative about pairwise capability contrasts. We support the position with a controlled simulation of nine scFMs on 200 benchmark cells, mirroring the panel evaluated by Elmarakeby et al. (2025). As mean pass-rate rises from 0.50 to 0.95, the probability that the empirically-best scFM is also the truly-best scFM drops from 0.75 to 0.41, and Fisher information at p_hat=0.95 is only 18% of its maximum at p_hat=0.5. We propose a four-item identifiability-aware scFM benchmarking standard: pass-rate distribution disclosure, posterior pairwise comparison, Fisher-information reporting, and a saturation-triggered redesign protocol.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 132
Loading