MEGA: A Large-Scale Molecular Editing Dataset for Guided-Action Optimization

Published: 24 Sept 2025, Last Modified: 15 Oct 2025NeurIPS2025-AI4Science PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Track 1: Original Research/Position/Education/Attention Track
Keywords: Open Source Datasets, Molecular Editing, Large Language Models, Reinforcement Learning
TL;DR: We release MEGA, a 31M-pair dataset for molecular editing and show through extensive benchmarking that large language models trained on MEGA with similarity-aware GRPO post-training achieve state-of-the-art performance and unmatched data efficiency.
Abstract: Large language models show strong potential for molecular editing, but progress has been constrained by the limited scale and quality of available training data. To address this, we introduce MEGA, a large-scale dataset of 31.4 million molecule pairs, where each pair represents a single property-improving chemical edit annotated with an explicit action: Replace, Insert, or Delete. We demonstrate MEGA’s utility in a controlled supervised fine-tuning (SFT) setting, where a model trained on MEGA outperforms models trained on existing datasets by up to +21.47 percentage points in hit ratio. Furthermore, we show that Group Relative Policy Optimization (GRPO) post-training with a similarity-aware reward achieves state-of-the-art performance and a remarkable∼36×improvement in data efficiency, while also preserving edit locality. We release MEGA in open access to the community to enable data-centric benchmarks and accelerate progress in molecular editing with generative models.
Submission Number: 153
Loading