Graph-based Subset Selection for Efficient Training of Gene Perturbation Models

Graph-based Subset Selection for Efficient Training of Gene Perturbation Models

TMLR Paper9002 Authors

17 May 2026 (modified: 22 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Genomic studies face a vast hypothesis space, while interventions such as gene perturbations remain costly and time-consuming. To accelerate such experiments, gene perturbation models predict the transcriptional outcome of interventions. Since constructing the training set is challenging, active learning is often employed in a “lab-in-the-loop” process. While this strategy makes training more targeted, it is substantially slower, as it fails to exploit the inherent parallelizability of Perturb-seq experiments. Here, we focus on graph neural network–based gene perturbation models and propose a subset selection method that, unlike active learning, selects the training perturbations in one shot. Our method chooses the interventions that maximize the propagation of the supervision signal to the model, thereby enhancing generalization. The selection criterion is defined over the input knowledge graph and is optimized with submodular maximization, ensuring a near-optimal guarantee. Experimental results across multiple datasets show that, in addition to providing months of acceleration compared to active learning, the method improves the stability of perturbation choices while maintaining competitive predictive accuracy.

Submission Type: Regular submission (no more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=PpaY4915hT

Changes Since Last Submission: We revised the manuscript to address the Action Editor's comments as described in our original answer. First, we corrected the Perturb-seq problem formulation. We clarify that the model input is a sampled control-cell expression vector and the target is an observed cell under the corresponding perturbation, but these do not form true pre/post measurements of the same cell. Second, we removed the incorrect MSE-specific formulation of the training objective. The revised formulation uses a model-specific loss and defines the per-perturbation population loss abstractly as $\mathcal{L}_u(\theta^S)$. We also made explicit that the theoretical argument only uses the gradients induced by this objective at the graph-propagated representations, and does not rely on a particular output loss. Third, we explain that GraphReach uses only information that is already inherent in GEARS (the perturbation data and the gene ontology graph), while IterPert (Huang et al,2024) requires external multimodal prios, such as additional Perturb-seq datasets, optical pooled screens, literature embeddings etc. Moreover, IterPert's empricial gains mainly come from integrating additional multimodal priors, as indicated by the prior-only model in Huang et al. Our goal is to evaluate selection methods that use only information available to the core model in order to isolate the effect of the selection strategy itself. That is why we do not include IterPert in the study, but we include TypiClust which was reported as the strongest non-multimodal baseline in Huang et al, and ACS-FW which was competitive. Fourth, we expanded the description of the cost-effective lazy forward strategy, explaining the priority queue of cached marginal gains, why submodularity makes cached gains valid upper bounds, and why the lazy implementation returns the same selections as exact greedy while avoiding redundant marginal-gain computations. We also revised the pseudocode accordingly. We additionally cleaned the appendix, corrected minor notation and grammar issues, and clarified reporting details such as metric scaling and computational infrastructure.

Assigned Action Editor: ~Romain_Lopez1

Submission Number: 9002

Loading