Is Nomenclature Beneficial to Language Models for Chemistry?

ACL ARR 2024 June Submission4353 Authors

16 Jun 2024 (modified: 21 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Most existing research in language model applications for chemistry employs the Simplified Molecular Input Line Entry System (SMILES) nomenclature, designed to encode molecular structure in a string format as both input and output. In contrast, machine learning approaches using human-readable IUPAC (International Union of Pure and Applied Chemistry) nomenclature remain underexplored. IUPAC names are widely used in the chemical literature, providing opportunities to train large language models on a vast corpus with contextual expertise. We are motivated to compare these two nomenclatures across various language-molecule scenarios. We found that simply switching to IUPAC names in challenging downstream tasks such as molecular generation, captioning, and editing results in a performance improvement of up to 4 times. Additionally, catastrophic forgetting during fine-tuning is reduced by half when using IUPAC names compared to SMILES.
Paper Type: Short
Research Area: Language Modeling
Research Area Keywords: applications; fine-tuning
Contribution Types: NLP engineering experiment
Languages Studied: English, molecular representation
Submission Number: 4353
Loading