Diversity-driven training of machine-learned force fields

Published: 05 Nov 2025, Last Modified: 05 Nov 2025AI4Mat-NeurIPS-2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Machine-learned force fields, dataset diversity, Vendi scores
TL;DR: We show the implications on dataset diversity in efficient training of machine-learned force fields
Abstract: We analyze the importance of training dataset diversity when training machine-learned force fields (MLFFs) with the goal of accelerating their development for AI-driven materials design frameworks. We specifically focus on ceramic systems (3C-SiC) relevant to thermal protection applications in hypersonic flight. We use the MACE model to represent our MLFFs. Each MACE model is trained on datasets sampled from ab initio molecular dynamics (AIMD) trajectories simulated at multiple temperatures. By diversity-driven sampling of different training datasets, we investigate the role of training set diversity in constructing an accurate force field with reduced data requirements. The material’s structural environment is encoded using the many-body tensor representation (MBTR), and similarity between configurations is quantified via a radial basis function (RBF) kernel and the Vendi score. Our results reveal that greater diversity in the sampled datasets yields more accurate force predictions even with smaller dataset size. These findings underscore the importance of systematically quantifying dataset diversity for efficient MLFF training and highlight a pathway for scalable force field development for automated materials design workflows.
Submission Track: Findings, Tools & Open Challenges
Submission Category: AI-Guided Design
Institution Location: Blacksburg, United States of America
Submission Number: 122
Loading