Advancing the ColabFit Exchange towards a Web-scale Data Source for Machine Learning Interatomic Potentials

Published: 08 Oct 2024, Last Modified: 03 Nov 2024AI4Mat-NeurIPS-2024 SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Submission Track: Tools
Submission Category: AI-Guided Design
Keywords: MLIP, data, database, interatomic potential
TL;DR: Updates to the ColabFit Exchange, a database for interatomic potential training data, including data statistics, modifications to the data standard and database backend, and new tools to use the database for machine learning applications
Abstract: Data-driven (DD) interatomic potentials (IPs) trained on large collections of first principles calculations are rapidly becoming essential tools in the fields of computational materials science and chemistry for discovery pipelines and performing atomic-scale simulations. Despite this, apart from a few notable exceptions, there is a distinct lack of well-organized, public datasets in common formats available for use with IP development. This deficiency precludes the research community from implementing widespread benchmarking, which is essential for gaining insight into model performance and transferability, and also limits the development of more general universal (perhaps even multi-source) IPs. To address this issue, last year we introduced the ColabFit Exchange, the first database providing open access to a large collection of systematically organized datasets from multiple domains that is especially designed for IP development. It has now grown to contain 369 datasets spanning nearly 400,000 unique chemistries. Here we discuss recent updates to the ColabFit Exchange, including data statistics for the ever-growing database, modifications to the data standard and database backend, and new tools to utilize the data for machine learning (ML) applications.
Submission Number: 72
Loading