- Keywords: Drug discovery, multilabel
- Abstract: DNA-Encoded Libraries (DEL thereafter) data, often with millions of data points, enables large deep learning models to make real contributions in the drug discovery process (e.g., hit-finding). The current state-of-the-art method of modeling DEL data, GCNN multiclass model, requires domain experts to create mutually exclusive classification labels from multiple selection readouts of DEL data, which is not always an ideal assumption to formulate the problem. In this work, we designed a GCNN multilabel architecture that directly models each selection data to eliminate the corresponding dependency on human expertise. We selected effective choices for key modeling components such as label reduction scheme from in silico evaluation.To assess its performance in real-world drug discovery settings, we further carried out prospective wet-lab testing where the multilabel model shows consistent improvement in hit-rate (percentage of hits in a proposed molecule list) over the current state-of-the-art multiclass model.
- Track: Original Research Track