Keywords: Perturbation prediction, Perturb-seq benchmark analysis, Evaluation Methodology, Baseline Saturation, Regime-aware Evaluation, Simple Baselines, Genetic Perturbations
TL;DR: We introduce baseline saturation for perturbation-prediction benchmarks, showing that aggregate rankings mix baseline-solved regimes with resistant perturbations where deep models have headroom.
Abstract: Whether deep learning models for perturbation prediction outperform simple baselines remains contested. We argue this debate is ill-posed: standard benchmarks mix perturbations already well-predicted by simple rules with others whose reproducible signal remains unexplained. We introduce Baseline Saturation, a per-perturbation measure of how much of the control-bounded dynamic range a matched simple baseline already captures. Across nine CRISPR perturbation datasets and six generalization regimes, we find that within-regime heterogeneity dominates: even within a single scenario, perturbation-level saturation spans the full range, with approximately 40\% of evaluations near-solved by baselines and 30\% remaining resistant. Deep learning models (scGPT, GEARS, PRESAGE) outperform baselines specifically on resistant perturbations, recovering which genes respond and in which direction, but not response magnitudes. Because saturated perturbations dominate most test sets, these gains are masked by aggregate reporting. Biologically, resistance is driven by the uniqueness of a perturbation's response relative to the dataset mean and is cell-type-dependent rather than gene-intrinsic. Our results reframe evaluation: simple baselines define the solved regime, and their failures map the frontier where expressive models add value.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 220
Loading