- Abstract: Chemical databases store information in text representations, and the SMILES format is a universal standard used in many cheminformatics software. Encoded in each SMILES string is structural information that can be used to predict complex chemical properties. In this work, we develop SMILES2vec, a deep RNN that automatically learns features from SMILES strings to predict a broad range of chemical properties, including toxicity, activity, solubility and solvation energy. Furthermore, we trained an interpretability mask for SMILES2vec solubility prediction, which identifies specific parts of a chemical that is consistent with ground-truth knowledge with an accuracy of 88%, demonstrating that neural networks can learn technically accurate chemical concepts.
- Keywords: Deep Neural Network, Recurrent Neural Network, Natural Language Processing, Cheminformatics, Chemistry
- TL;DR: SMILES2vec: A RNN that reads chemical text representation to predict chemical properties and learns about real chemistry in the process.