WelQrate: Defining the Gold Standard in Small Molecule Drug Discovery Benchmarking

Published: 26 Sept 2024, Last Modified: 13 Nov 2024NeurIPS 2024 Track Datasets and Benchmarks PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: drug discovery, benchmarking, small molecule
Abstract: While deep learning has revolutionized computer-aided drug discovery, the AI community has predominantly focused on model innovation and placed less emphasis on establishing best benchmarking practices. We posit that without a sound model evaluation framework, the AI community's efforts cannot reach their full potential, thereby slowing the progress and transfer of innovation into real-world drug discovery. Thus, in this paper, we seek to establish a new gold standard for small molecule drug discovery benchmarking, *WelQrate*. Specifically, our contributions are threefold: ***WelQrate*** **dataset collection** - we introduce a meticulously curated collection of 9 datasets spanning 5 therapeutic target classes. Our hierarchical curation pipelines, designed by drug discovery experts, go beyond the primary high-throughput screen by leveraging additional confirmatory and counter screens along with rigorous domain-driven preprocessing, such as Pan-Assay Interference Compounds (PAINS) filtering, to ensure the high-quality data in the datasets; ***WelQrate*** **Evaluation Framework** - we propose a standardized model evaluation framework considering high-quality datasets, featurization, 3D conformation generation, evaluation metrics, and data splits, which provides a reliable benchmarking for drug discovery experts conducting real-world virtual screening; **Benchmarking** - we evaluate model performance through various research questions using the *WelQrate* dataset collection, exploring the effects of different models, dataset quality, featurization methods, and data splitting strategies on the results. In summary, we recommend adopting our proposed *WelQrate* as the gold standard in small molecule drug discovery benchmarking. The *WelQrate* dataset collection, along with the curation codes, and experimental scripts are all publicly available at www.WelQrate.org.
Supplementary Material: pdf
Submission Number: 2251
Loading