Scaffold-Aware Generative Augmentation and Reranking for Enhanced Virtual Screening

20 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Virtual Screening, Data Augmentation, Diffusion Model, Graph Neural Networks
TL;DR: ScaffAug improves ligand-based virtual screening by addressing class and scaffold imbalance and boosting structural diversity through generative augmentation, self-training, and diversity-aware reranking.
Abstract: Ligand-based virtual screening (VS) is an essential step in drug discovery that evaluates large chemical libraries to identify compounds that potentially bind to a therapeutic target. However, VS faces three major challenges: class imbalance due to the low active rate, structural imbalance among active molecules where certain scaffolds dominate, and the need to identify structurally diverse active compounds for novel drug development. We introduce **ScaffAug**, a scaffold-aware VS framework that addresses these challenges through three modules. The *augmentation module* first generates synthetic data conditioned on scaffolds of actual hits using generative AI, specifically a graph diffusion model. This helps mitigate the class imbalance and furthermore the structural imbalance, due to our proposed scaffold-aware sampling algorithm, designed to produce more samples for active molecules with underrepresented scaffolds. A model-agnostic *self-training module* is then used to safely integrate the generated synthetic data from our augmentation module with the original labeled data. Lastly, we introduce a *reranking module* that improves VS by enhancing scaffold diversity in the top recommended set of molecules, while still maintaining and even enhancing the overall general performance of identifying novel, active compounds. We conduct comprehensive computational experiments across five target classes, comparing ScaffAug against existing baseline methods by reporting the performance of multiple evaluation metrics and performing ablation studies on ScaffAug. Overall, this work introduces novel perspectives on effectively enhancing VS by leveraging generative augmentations, reranking, and general scaffold-awareness.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 22970
Loading