MEGA: A Large-Scale Molecular Editing Dataset for Guided-Action Optimization

ICLR 2026 Conference Submission16703 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Open Source Datasets, Molecular Editing, Large Language Models, Reinforcement Learning
TL;DR: We release MEGA, a 31M-pair dataset for molecular editing and show through extensive benchmarking that large language models trained on MEGA with similarity-aware GRPO post-training achieve state-of-the-art performance and unmatched data efficiency.
Abstract: Large language models show strong potential for molecular editing, but progress has been constrained by the limited scale and quality of available training data. To address this, we introduce MEGA, a family of large-scale datasets comprising 31M molecule pairs, each representing a single property-improving chemical edit annotated with an explicit action: Replace, Insert, or Delete. We demonstrate MEGA's utility in a controlled supervised fine-tuning (SFT) setting, where a model trained on MEGA outperforms models trained on existing datasets by up to +21.47 percentage points in hit ratio. Furthermore, we show that Group Relative Policy Optimization (GRPO) post-training with a similarity-aware reward achieves state-of-the-art performance and a remarkable $\sim36\times$ improvement in data efficiency, while also preserving edit locality. We release MEGA in open access to the community to enable data-centric benchmarks and accelerate progress in molecular editing with generative models.
Primary Area: datasets and benchmarks
Submission Number: 16703
Loading