Atoms as Words: A Novel Approach to Deciphering Material Properties using NLP-inspired Machine Learning on Crystallographic Information Files (CIFs)

19 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Condensed matter physics, Materials science, Material properties prediction, Crystallographic Information Files (CIFs), Natural Language Processing (NLP) in materials, Word2Vec-inspired technique, Atomic embeddings, CIFSemantics model, Band gap, Formation energy, Material representation
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: Inspired by NLP, our model, CIFSemantics, treats atoms and atomic positions in CIFs as words in text, revolutionizing material property predictions and tapping into the underutilized potential of CIFs for ML-based material representation.
Abstract: In condensed matter physics and materials science, predicting material properties necessitates understanding intricate many-body interactions. Conventional methods such as density functional theory (DFT) and molecular dynamics (MD) often resort to simplifying approximations and are computationally expensive. Meanwhile, recent machine learning methods use handcrafted descriptors for material representation which sometimes neglect vital crystallographic information and are often limited to single property prediction or a sub-class of crystal structures. In this study, we pioneer an unsupervised strategy, drawing inspiration from Natural Language Processing (NLP), to harness the underutilized potential of Crystallographic Information Files (CIFs). We conceptualize atoms and atomic positions within a CIF similarly to words in textual content. Using a Word2Vec-inspired technique, we produce atomic embeddings that capture intricate atomic relationships. Our model, CIFSemantics, trained on the extensive Material Project dataset, adeptly predicts 15 distinct material properties from the CIFs. Its performance rivals specialized models, marking a significant step forward in material property predictions.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2024
Loading