Fine-Tuned Language Models Generate Stable Inorganic Materials as Text

Nate Gruver; Anuroop Sriram; Andrea Madotto; Andrew Gordon Wilson; C. Lawrence Zitnick; Zachary Ward Ulissi

Fine-Tuned Language Models Generate Stable Inorganic Materials as Text

Nate Gruver, Anuroop Sriram, Andrea Madotto, Andrew Gordon Wilson, C. Lawrence Zitnick, Zachary Ward Ulissi

Published: 16 Jan 2024, Last Modified: 05 Mar 2024ICLR 2024 posterEveryoneRevisionsBibTeX

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: generative model, large language model, stable materials, AI for science

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We fine-tune LLMs on string representations of crystals and generate new stable materials.

Abstract: We propose fine-tuning large language models for generation of stable materials. While unorthodox, fine-tuning large language models on text-encoded atomistic data is simple to implement yet reliable, with around 90\% of sampled structures obeying physical constraints on atom positions and charges. Using energy above hull calculations from both learned ML potentials and gold-standard DFT calculations, we show that our strongest model (fine-tuned LLaMA-2 70B) can generate materials predicted to be metastable at about twice the rate (49\% vs 28\%) of CDVAE, a competing diffusion model. Because of text prompting's inherent flexibility, our models can simultaneously be used for unconditional generation of stable material, infilling of partial structures and text-conditional generation. Finally, we show that language models' ability to capture key symmetries of crystal structures improves with model scale, suggesting that the biases of pretrained LLMs are surprisingly well-suited for atomistic data.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Submission Number: 5580

Loading