Track: Main track (up to 8 pages)
Abstract: Deep neural networks (DNNs) have improved our ability to predict regulatory activity from DNA sequences, providing valuable insights into gene regulation. However, these models often fail to generalize to sequences underrepresented in their training data, limiting applications like variant effect prediction and de novo sequence design. This limitation reflects a bias toward natural variation across the genome, making DNNs vulnerable to covariate shifts, where test sequences diverge statistically from the training distribution. Here, we introduce PIONEER, a computational platform that simulates functional genomics experiments to systematically benchmark and optimize training data composition through iterative AI-experiment cycles. Using PIONEER, we compare sequence proposal strategies—including active learning and random baselines—evaluating their impact on model generalization across increasing levels of covariate shift. To ensure a fair comparison, we also assess each approach within a fixed experimental budget, accounting for DNA synthesis costs. PIONEER provides a scalable and extensible framework for optimizing training data composition to enhance model generalization, advancing applications in regulatory genomics, synthetic biology, and precision medicine.
Submission Number: 44
Loading