Migration as a Probe: A Generalizable Benchmark Framework for Specialist vs. Generalist Machine-Learned Force Fields

Yi Cao; Paulette Clancy

Migration as a Probe: A Generalizable Benchmark Framework for Specialist vs. Generalist Machine-Learned Force Fields

Yi Cao, Paulette Clancy

Published: 20 Sept 2025, Last Modified: 29 Oct 2025AI4Mat-NeurIPS-2025 SpotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Machine-learned force fields, Foundation models, Fine-tuning, Benchmarking, 2D materials, Doping Engineering, Atomic migration

TL;DR: We benchmark specialist vs. generalist machine-learned force fields in 2D materials and show that atomic migration pathways provide an efficient probe to distinguish model reliability.

Abstract: Machine-learned force fields (MLFFs), particularly pre-trained foundation models, are revolutionizing computational materials science by enabling *ab initio*-level accuracy at the length- and time-scales of classical molecular dynamics (MD). However, their rapid proliferation presents a critical strategic question: Should researchers train bespoke "specialist" models from scratch, fine-tune large "generalist" foundation models, or employ hybrid approaches? The trade-offs in data efficiency, predictive accuracy, computational cost, and susceptibility to out-of-distribution failure remain poorly understood, as does the fundamental question of how different training paradigms affect learned physical representations. Here, we introduce a systematic benchmarking framework that addresses this question using defect migration pathways, evaluated via Nudged Elastic Band trajectories, as diagnostic probes that simultaneously test interpolation and extrapolation capabilities. Using Cr-doped Sb₂Te₃ as a technologically relevant 2D material case study, we benchmark multiple training strategies within the MACE architecture across equilibrium, kinetic (atomic migration), and mechanical (interlayer sliding) properties. Our key findings reveal that while all models adequately capture equilibrium structures, their predictions for non-equilibrium processes diverge dramatically. Targeted fine-tuning substantially outperforms both from-scratch and zero-shot approaches for kinetic properties, but induces catastrophic forgetting of long-range physics. Critically, analysis of learned representations shows that different training paradigms produce fundamentally distinct, non-overlapping latent space encodings, suggesting they capture different aspects of the underlying physics. This work provides practical guidelines for MLFF development and establishes migration-based probes as an efficient, broadly applicable strategy for distinguishing model quality. We hope that this approach will offer a diagnostic framework that links performance to learned representations, paving the way for more intelligent, uncertainty-aware active learning strategies.

Submission Track: Benchmarking in AI for Materials Design - Full Paper

Submission Category: AI-Guided Design

Institution Location: Baltimore, United States

AI4Mat Journal Track: Yes

AI4Mat RLSF: Yes

Submission Number: 26

Loading