Learning to Evolve: Open-ended Molecular Optimization with Progress-shaped RL

Xuan Li; Zhanke Zhou; Zongze Li; Jiangchao Yao; Tongliang Liu; Bo Han

Learning to Evolve: Open-ended Molecular Optimization with Progress-shaped RL

Xuan Li, Zhanke Zhou, Zongze Li, Jiangchao Yao, Tongliang Liu, Bo Han

Published: 17 Jun 2026, Last Modified: 17 Jun 2026ICML 2026 AI4Math Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model; Self-evolving; Scientific Agent; Molecular Optimization

Abstract: Open-ended molecular optimization requires models that learn reusable refinement strategies from evaluator feedback, rather than merely expanding test-time search. Yet training-free optimization leaves the model unchanged, while outcome-only reinforcement learning with verifiable rewards (RLVR) collapses a $\textit{multi-turn}$ refinement trajectory into a $\textit{single}$ reward, obscuring which edits truly improve molecular quality under validity constraints. To turn molecular evolution trajectories into policy-improving supervision, we introduce $\textit{Learning to Evolve}$ (L2E), a long-horizon RL framework that extends RLVR from verifying final outputs to learning refinement dynamics. Instead of assigning one outcome reward to an entire generation, L2E constructs $\textit{turn-specific evolution advantages}$ from validity-gated evaluator feedback, coupling local edit progress with cumulative trajectory utility across sibling refinement chains. This turns intermediate molecular states into learnable supervision, enabling policy optimization over $\textit{how candidates evolve}$, without critic models, reference trajectories, or preference labels. Empirically, L2E delivers strong gains across molecular optimization benchmarks. On TOMG-Bench, it achieves 19.36 RI on LogP and 4.14 RI on MR, outperforming GRPO by 7.8$\times$ and 7.6$\times$, respectively. On multi-property BDP optimization, L2E reaches 4.11 RI on seen instructions and 3.63 RI on unseen ones, improving over the strongest RLVR baselines by 2.8$\times$ and 3.6$\times$. These results show that learning credit over evolution trajectories is a key step toward scalable, policy-improving molecular discovery.

Submission Number: 83

Loading