PRESTO: Progressive Pretraining Enhances Synthetic Chemistry Outcomes

ACL ARR 2024 June Submission4178 Authors

16 Jun 2024 (modified: 09 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Multimodal Large Language Models (MLLMs) have seen growing adoption across various scientific disciplines. These advancements encourage the investigation of molecule-text modeling within synthetic chemistry, a field dedicated to designing and conducting chemical reactions to synthesize new compounds with desired properties and applications. Current approaches, however, often neglect the critical role of multi-molecule graph interaction in understanding chemical reactions, leading to suboptimal performance in synthetic chemistry tasks. This study introduces PRESTO (Progressive Pretraining Enhances Synthetic Chemistry Outcomes), a new framework that bridges the molecule-text modality gap by integrating a comprehensive benchmark of pretraining strategies and dataset configurations. It progressively improves multimodal LLMs through cross-modal alignment and multi-graph understanding. Our extensive experiments demonstrate that PRESTO offers competitive results in downstream synthetic chemistry tasks. The code can be found at https://anonymous.4open.science/r/presto.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Reaction Prediction, Retrosynthesis Prediction, Molecule Modeling
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 4178
Loading