Keywords: Large Language Model, Molecular Optimization, LLM Reasoning
TL;DR: We propose DePO to train language models optimizing molecules using reference molecules as guidance.
Abstract: Large language models (LLMs) have demonstrated impressive mathematical reasoning capabilities when trained with reinforcement learning with verifiable rewards (RLVR), particularly through Group Relative Policy Optimization (GRPO). However, extending these methods to scientific domains such as $\textit{molecular optimization}$ is challenging, as LLMs often lack the necessary domain-specific reasoning skills. Molecular optimization involves optimizing molecular properties while preserving structural similarity, leading to a complex combinatorial search. Existing models struggle due to conflicting objectives, limited chemical reasoning, and the scarcity of datasets with intermediate reasoning steps, which hinders learning effective strategies. To address these issues, we introduce $\textbf{De}$monstration-guided $\textbf{P}$olicy $\textbf{O}$ptimization (DePO). This framework leverages reference molecules as demonstrations to guide model exploration toward promising regions of chemical space. Specifically, DePO incorporates demonstrations as supervised signals for each reasoning chain, to regularize the search direction while preserving the model's reasoning capabilities. Experiments show that DePO significantly outperforms both supervised fine-tuning and GRPO approaches across key molecular optimization metrics, and excels in balancing the competitive optimization objectives. DePO also shows generalization capabilities and inference-scaling properties.
Submission Number: 49
Loading