Additive Cook’s Distance Guided Training Set Reduction for Generalizable Foundation Models of Interatomic Potentials
Keywords: MLIP, Foundation Models, Data Pruning, Data Subsampling, Cook's Distance, Influence Score
TL;DR: We introduce Stepwise Additive Cook's Distance, a method to curate a subset of training data that yields more accurate and computationally efficient ML foundation models of interatomic potentials.
Abstract: Foundation models for machine learned interatomic potentials (MLIPs) build upon large training sets that are computationally expensive and often contain redundant information that impairs generalization. To address this, we derive Additive Cook's Distance (ACD), a novel influence measure quantifying the impact of data point addition. We use this in our stepwise ACD algorithm, an iterative method that starts with a small data subset and greedily adds the most influential configurations from the remaining pool. We validate our approach on two distinct MLIP benchmarks. For a linear qSNAP potential on a beryllium dataset with high configurational diversity, stepwise ACD achieves full-dataset accuracy using only half the data. We then apply our method to the non-linear MACE model by first linearizing it to select a representative subset from the chemically diverse Materials Project (MPTrj) dataset. A final MACE model trained only on this curated subset shows superior generalization to unseen structures, outperforming a model trained on the full dataset. This work demonstrates that stepwise ACD is a powerful strategy to reduce computational cost while enhancing the generalizability of MLIP foundation models.
Submission Track: Paper Track (Short Paper)
Submission Category: AI-Guided Design
Institution Location: {Austin, Texas}, {Los Alamos, New Mexico}, {Baku, Azerbaijan}, {Düsseldorf, Germany}
AI4Mat RLSF: Yes
Submission Number: 130
Loading