TL;DR: Dataset and benchmark paper for a 81 million small molecule DNA-encoded library to find hits for drug discovery
Abstract: DNA-Encoded Libraries (DELs) represent a transformative technology in drug discovery, facilitating the high-throughput exploration of vast chemical spaces. Despite their potential, the scarcity of publicly available DEL datasets presents a bottleneck for the advancement of machine learning methodologies in this domain. To address this gap, we introduce KinDEL, one of the largest publicly accessible DEL datasets and the first one that includes binding poses from molecular docking experiments. Focused on two kinases, Mitogen-Activated Protein Kinase 14 (MAPK14) and Discoidin Domain Receptor Tyrosine Kinase 1 (DDR1), KinDEL includes 81 million compounds, offering a rich resource for computational exploration. Additionally, we provide comprehensive biophysical assay validation data, encompassing both on-DNA and off-DNA measurements, which we use to evaluate a suite of machine learning techniques, including novel structure-based probabilistic models. We hope that our benchmark, encompassing both 2D and 3D structures, will help advance the development of machine learning models for data-driven hit identification using DELs.
Lay Summary: DNA-Encoded Libraries (DELs) are extensive collections of chemical compounds, each tagged with a unique DNA barcode. These libraries allow scientists to quickly test millions of compounds to see if they bind to specific targets involved in diseases. Currently, a significant challenge in the field is the scarcity of available DEL datasets. Without these vital resources, researchers face challenges in developing and comparing machine learning techniques effectively, which slows down progress in identifying potential new treatments.
To tackle this issue, we introduce KinDEL, a robust dataset containing 81 million compounds, specifically designed to propel the development of machine learning models for DEL research. KinDEL is a vast library that includes compounds tested against two kinase targets and offers a new benchmark with biophysical data for selected compounds, both with and without DNA tags.
The release of the KinDEL dataset equips the scientific community with the necessary tools to develop advanced machine learning models for DEL analysis, ultimately accelerating the discovery of new drug candidates. This initiative represents an important step forward in making DEL datasets more accessible for drug discovery research.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/insitro/kindel
Primary Area: Applications->Chemistry, Physics, and Earth Sciences
Keywords: DEL, small molecule, dataset, benchmark, drug discovery
Submission Number: 13707
Loading