Keywords: retrosynthesis, graph neural network, simulated reaction
TL;DR: We introduce a method to train the retrosynthesis model with simulated reactions to increase prediction accuracy, confidence and diversity.
Abstract: Retrosynthesis analysis aims to design reaction pathways and intermediates for a target compound. Emerging works have been developed to automate this process by machine learning (ML) approaches, which greatly accelerate the process of synthesis pathway design. As data-driven approaches, ML models learn the synthetic pathways from existing reaction data. Although there are multiple synthesis pathways to synthesize one target product, one reaction product usually only has one corresponding reactant set in the training dataset. Therefore, existing models were trained by considering all the other reaction pathways as negative labels, which potentially includes a huge amount of false-negative data. In this work, we generate virtually validated simulated reactions by enumerating local reaction templates on the known reaction center and investigate the effect of training retrosynthesis models with these simulated reactions. We found that not only prediction accuracy but both prediction diversity and prediction confidence are also largely increased when training retrosynthesis models with simulated reactions. Specifically, the round-trip accuracy of top-5 prediction is increased from 86.2% to 90.8%, and the round-trip accuracy of predictions having output scores greater than 0.5 is increased from 93.5% to 96.4%. Moreover, the ratio of predictions showing output scores greater than 0.5 is increased from 6.0% to 26.5%. We also show that models trained with simulated reactions have a preference to predict more diverse synthesis pathways, including the reactions that are rarely seen in the training set.