Machine Learning for PROTAC Engineering

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Deep learning, Chemoinformatics, PROTAC, Drug design.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: This work introduces an open-source machine learning toolkit to predict PROTAC molecule degradation effectiveness with a 77.36% validation accuracy, while also making it accessible and reproducible for drug development.
Abstract: PROTACs are a promising therapeutic technology that harnesses the cell's built-in degradation processes to degrade specific proteins. Despite their potential, developing new PROTAC molecules is challenging and requires significant expertise, time, and cost. Meanwhile, machine learning has transformed various scientific fields, including drug development. In this work, we present a strategy for curating open-source PROTAC data and propose an open-source toolkit for predicting the degradation effectiveness, i.e., activity, of novel PROTAC molecules. We organized the curated data into 16 different datasets ready to be processed by machine learning models. The datasets incorporate important features such as $pDC_{50}$, $D_{max}$, E3 ligase type, POI amino acid sequence, and experimental cell type. Our toolkit includes a configurable PyTorch dataset class tailored to process PROTAC features, a customizable machine learning model for processing various PROTAC features, and a hyperparameter optimization mechanism powered by Optuna. To evaluate the system, three surrogate models were developed utilizing different PROTAC representations. Using our automatically-curated public datasets, the best models achieved a 71.4% validation accuracy and a 0.73 ROC-AUC validation score. This is not only comparable to state-of-the-art models for protein degradation prediction, but also open-source, easily-reproducible, and less computationally complex than existing approaches.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7300
Loading