SMILES2vec: Predicting Chemical Properties from Text Representations

Garrett B. Goh, Nathan Hodas, Charles Siegel, Abhinav Vishnu

Feb 12, 2018 (modified: Feb 12, 2018) ICLR 2018 Workshop Submission readers: everyone
  • Abstract: Chemical databases store information in text representations, and the SMILES format is a universal standard used in many cheminformatics software. Encoded in each SMILES string is structural information that can be used to predict complex chemical properties. In this work, we develop SMILES2vec, a deep RNN that automatically learns features from SMILES strings to predict a broad range of chemical properties, including toxicity, activity, solubility and solvation energy. Furthermore, we trained an interpretability mask for SMILES2vec solubility prediction, which identifies specific parts of a chemical that is consistent with ground-truth knowledge with an accuracy of 88%, demonstrating that neural networks can learn technically accurate chemical concepts.
  • TL;DR: SMILES2vec: A RNN that reads chemical text representation to predict chemical properties and learns about real chemistry in the process.
  • Keywords: Deep Neural Network, Recurrent Neural Network, Natural Language Processing, Cheminformatics, Chemistry