On the Importance of Pretraining Data Alignment for Atomic Property Prediction

TMLR Paper5265 Authors

02 Jul 2025 (modified: 05 Jul 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: This paper challenges the recent paradigm in atomic property prediction that links progress to growing dataset sizes and computational resources. We show that pretraining on a carefully selected task-aligned dataset can match or even surpass large-scale pretraining, while using only 1/24th of the computational cost. We introduce the Chemical Similarity Index (CSI), a simple metric for molecular graphs inspired by the Fréchet Inception Distance in computer vision, which quantifies the alignment between upstream pretraining datasets and downstream tasks. By selecting the most aligned dataset with minimal CSI distance, we show that models pretrained on a smaller, focused dataset consistently achieve better performance on downstream tasks than those pretrained on massive, mixed datasets such as JMP. This holds even when the mixed dataset includes the upstream dataset most aligned with the downstream task. Counterintuitively, we also find that indiscriminately adding more data can degrade model performance when the additional data is poorly aligned with the target task. Our findings highlight that quality often outperforms quantity in pretraining for atomic property prediction.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Anuroop_Sriram1
Submission Number: 5265
Loading