From Molecules to Materials: Pre-training Large Generalizable Models for Atomic Property Prediction
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: atomic property prediction, pre-training, 3D atomic pre-training, graph neural networks, multi-task learning, molecules, materials
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We pre-train a large model on multiple chemical datasets in a multi-task learning framework to generate transferable atomic representations that can be fine-tuned for SOTA results across various tasks.
Abstract: The role of machine learning in computing atomic properties is expanding rapidly for a wide range of applications from healthcare to climate change. One important ingredient that has enabled this development is the creation of large and diverse molecular datasets. Given the extreme computational cost of these datasets, an important question moving forward is: Can we limit the need for exhaustive large dataset creation by pre-training a foundation style model over multiple chemical domains to generate transferable atomic representations for downstream fine-tuning tasks? Generalization across the entire molecular space is challenging due to the range and complexity of atomic interactions that exist. In this paper, we present Joint Multi-domain Pre-training (JMP), a robust supervised pre-training strategy that utilizes data from multiple chemical domains, $\sim$120 million examples in total. We demonstrate state-of-the-art results across many targets of the rMD17, QM9, MatBench, QMOF, SPICE, and MD22 datasets. Finally, we conduct ablations to study the impact of different components of JMP on downstream performance.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8604