Forces are not Enough: Benchmark and Critical Evaluation for Machine Learning Force Fields with Molecular Simulations
Abstract: Molecular dynamics (MD) simulation techniques are widely used for various natural science applications. Increasingly, machine learning (ML) force field (FF) models begin to replace ab-initio simulations by predicting forces directly from atomic structures. Despite significant progress in this area, such techniques are primarily benchmarked by their force/energy prediction errors, even though the practical use case would be to produce realistic MD trajectories. We aim to fill this gap by introducing a novel benchmark suite for ML MD simulation. We curate representative MD systems, including water, organic molecules, peptide, and materials, and design evaluation metrics corresponding to the scientific objectives of respective systems. We benchmark a collection of state-of-the-art (SOTA) ML FF models and illustrate, in particular, how the commonly benchmarked force accuracy is not well aligned with relevant simulation metrics. We demonstrate when and how selected SOTA methods fail, along with offering directions for further improvement. Specifically, we identify stability as a key metric for ML models to improve. Our benchmark suite comes with a comprehensive open-source codebase for training and simulation with ML FFs to facilitate future work.
Certifications: Survey Certification
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Camera Ready: - We improved the alanine dipeptide evaluation protocol to use constrained dynamics when simulating with ML force fields and report updated results. We thank Adrian Roitberg and the Roitberg Research Group for their valuable suggestions. - We included OPLS results for all MD17 molecules benchmarked. We thank all reviewers for their helpful feedbacks. All changes are colored in brown in the updated draft. - We moved the metrics section to the main paper and modified writing to better explain and motivate the metrics selections. - We discuss how we design the stability metrics and the selection of stability thresholds, with experiment results shown in Figure 9. - We restructured the introduction and preliminaries to introduce the problem and motivate the benchmark better, with an illustrative figure in Figure 1. - We conducted an experiment that simulated the Aspirin molecule with a classical force field and reported the results in Appendix A, Figure 7. We compared it to the reference quantum-mechanical simulation and demonstrated ML force fields could recover quantum-accuracy better than classical force fields. - We added an introduction to the symmetry principles of ML force fields, with illustration in Figure 10. - We extended the discussion on previous works and our distinction. - We added more details on dataset generation, selection of benchmarked model, and discussion on the stability issue.
Supplementary Material: zip
Assigned Action Editor: ~Stephan_M_Mandt1
Submission Number: 798