LeMat-Bulk: aggregating, and de-duplicating quantum chemistry materials databases

Published: 03 Mar 2025, Last Modified: 09 Apr 2025AI4MAT-ICLR-2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Submission Track: Paper Track (Tiny Paper)
Submission Category: AI-Guided Design
Keywords: materials-discovery, dataset, material-fingerprint, crystals
Supplementary Material: pdf
TL;DR: We propose a unified and standardized dataset of over 5.3 million materials along with a material fingerprint for identifying duplicates from the Materials Project, OQMD, and Alexandria datbases.
Abstract: The rapid expansion of material science databases enables the training of predictive machine learning models that deliver fast, accurate estimates of materials properties, as well as generative models that explore the vast combinatorial space of material candidates. Initiatives like the Materials Project, OQMD, and Alexandria have greatly expanded the scope of computational materials science and fueled progress in the materials science community. However, they also introduced challenges related to duplication, data integration, and interoperability which complicates efforts to develop scalable machine learning models. To address these challenges, we introduce LeMat-Bulk, a unified dataset combining Density Functional Theory (DFT) calculations from the Materials Project, OQMD, and Alexandria. This dataset encompasses over 5.3 million materials across three DFT functionals, including the largest repository of PBESol and SCAN functional calculations ($\sim$500k). Our methodology standardizes DFT calculations across databases with varying parameters, resolving inconsistencies and enhancing cross-compatibility. Besides, we propose and benchmark a hashing function (BAWL) built on Ongari et al. (2022) that generates identifiers for crystalline inorganic materials by capturing their structural and compositional properties.
Submission Number: 64
Loading