Exploring Neural Scaling Laws in Molecular Pretraining with Synthetic Tasks

Rodrigo Hormazabal; Seung Woo Ko; Inwan Yoo; Sehui Han; Paul Bertens

Exploring Neural Scaling Laws in Molecular Pretraining with Synthetic Tasks

Rodrigo Hormazabal, Seung Woo Ko, Inwan Yoo, Sehui Han, Paul Bertens

Published: 17 Jun 2024, Last Modified: 26 Jul 2024ICML2024-AI4Science PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Synthetic Pretraining, Neural Scaling Laws, Foundation Models, Molecular Representation Learning

TL;DR: We explore the impact of synthetic pre-training on molecular models and its correlation with downstream task performance in various tasks.

Abstract: Acquiring large-scale annotated molecular datasets has been one of the main bottlenecks in scaling foundational models for chemistry applications. Proprietary issues, cost, and complex experimental setups have led to restrictively small labeled datasets for some predictive tasks. Unlike in language models, where pre-training has shown significant improvements in downstream tasks, molecular pre-training has yet to demonstrate a similar impact. Inspired by the success of neural scaling laws, we explore the impact of synthetic pre-training for molecular machine learning models on their downstream performance. We pre-train models of three sizes (12M, 45M, 230M parameters) using 12 synthetic tasks in an encoder-decoder setup and evaluate their performance across 10 downstream tasks. Our findings reveal a general correlation between pre-training perplexity and downstream task performance, with this relationship varying across different tasks. These insights suggest that pre-training metrics could provide valuable estimates of model performance post-fine-tuning. We pre-train models with over 10B tokens and observe that models saturate, indicating potential for further parameter scaling. This study represents a preliminary exploration of synthetic task pre-training for molecular models, which can be complementary to other pre-training methods such as multi-task learning with labeled data and multimodal learning.

Submission Number: 237

Loading