Keywords: AI for Science, Drug Virtual Screen, Hard Negative Mining, Data Augmentation, Contrastive Learning
TL;DR: GenDrugCLIP leverages SBDD to generate pseudo positives and hard negatives, significantly improving virtual screening performance by addressing the scarce binding data and trivial negatives in drug-traget CLIP methods.
Abstract: Virtual screening (VS) has become an indispensable component of early drug discovery, aiming to identify potential ligands for a given protein target. While CLIP-style methods (e.g., DrugCLIP) have emerged as a powerful solution by enabling efficient compound retrieval through drug-target representation alignment, current models face two fundamental challenges: (1) the scarcity of true binding data for training limits coverage of diverse binding modes, and (2) the use of trivial negatives—molecules binding to other pockets—leads to a significant train-test domain gap. To address these challenges, we introduce GenDrugCLIP, a novel generation-augmented framework that repositions structure-based drug design (SBDD) models as controllable data engines. GenDrugCLIP implements a Generate-Filter-Score-Select pipeline to construct target-aware pseudo positives and hard negatives for triplet contrastive learning. Our approach not only expands the chemical space but also prevents the model from relying solely on trivial negatives. Extensive experiments on three benchmarks demonstrate that GenDrugCLIP achieves state-of-the-art performance, outperforming DrugCLIP by +7.66% in BEDROC and +7.45 in early enrichment on the DUD-E benchmark. Our work highlights the untapped potential of SBDD models as powerful data engines for representation learning, opening a new paradigm for data-efficient drug discovery.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 4973
Loading