Bigger is not always better: evaluating target-specific dataset design strategies for regioselectivity prediction on complex molecules
Track: Track 1: Original Research/Position/Education/Attention Track
Keywords: active learning, experiment design, organic chemistry, regioselectivity prediction
TL;DR: We develop an target-informed active learning strategy to reduce the experimental burden of modeling regioselectivity on complex molecules.
Abstract: There has been growing interest in using ML models for prediction of reaction yields and selectivity in synthetic chemistry. However, the difficulty and cost of generating experimental data has proved a roadblock in creating practical models for these tasks. For this reason, rational dataset design strategies are emerging in the field, typically limited to clustering approaches to sample the overall chemical space broadly. However, in many real-world contexts like synthetic route planning, the chemist is often narrowly interested in accurate predictions on a specific, known target. As such, we propose a contrasting dataset design strategy that exploits knowledge of the target to create small models focused on local regions of chemical space. We design a series of acquisition functions that consider model uncertainty, several metrics of chemical similarity, and varying degrees of dataset diversity. We find that an active learning strategy that selects training molecules similar to uncertain regions of the target outperforms approaches that consider target similarity alone. Target-focused data sets significantly reduced data requirements; in fact, these smaller datasets could achieve accuracy on targets where larger, diversity-oriented or randomly selected data sets failed. Evaluation was performed on two literature datasets of C–H functionalization reactions, along with experimental validation on five complex targets. In this process, we developed a new regioselectivity prediction tool for a reaction that had not been modeled prior. To conclude, we discuss our ongoing work in developing a stopping criterion for the active learning loop to enable a full experimental implementation of this workflow.
Submission Number: 206
Loading