MolOpt-Eval: Can Frontier LLMs Perform Structure-Based Hit-to-Lead Optimization?

Chengzhu Li; Haichuan Tan; Wenyu Zhu; Bowen Gao; Jiqing Zheng; Ya-Qin Zhang; Wei-Ying Ma; Yanyan Lan

MolOpt-Eval: Can Frontier LLMs Perform Structure-Based Hit-to-Lead Optimization?

Chengzhu Li, Haichuan Tan, Wenyu Zhu, Bowen Gao, Jiqing Zheng, Ya-Qin Zhang, Wei-Ying Ma, Yanyan Lan

Published: 28 May 2026, Last Modified: 09 Jun 2026GenBio 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Molecule Optimization, LLM, Evaluation

Abstract: Structure-based molecular optimization, namely the iterative editing of hit molecules to enhance binding affinity while preserving pocket geometry constraints, is a fundamental task in computer-aided drug discovery. Despite the rapid progress of Large Language Models (LLMs) in scientific reasoning, their efficacy as molecular optimizers remains under-explored. In this work, we first demonstrate through end-to-end experiments that frontier LLMs exhibit subpar optimization performance: affinity gains remain marginal, and high-affinity leads frequently deteriorate upon modification. To diagnose where in the reasoning chain these failures originate, we introduce MolOpt-Eval, a diagnostic benchmark that decomposes structure-based molecular optimization into three independently evaluable cognitive stages: Structural Perception, Strategy Discovery, and Strategy Execution, supplemented by chain-of-thought quality analysis. Evaluating 14 frontier LLMs across 30 DUD-E protein targets with 13 diagnostic tasks, we reveal that: (i) LLMs achieve high accuracy on 2D molecular structure, with micro-F1 = 0.92, and protein fold classification, up to 90%, yet 3D interaction perception drops sharply, with F1 < 0.40; (ii) proposed strategies are chemically plausible, over 93%, and produce valid molecules, over 95% SMILES validity, but fewer than 2.5% achieve their intended interactions when validated through Boltz-2 co-folding; (iii) model capability differences vanish on the hardest tasks, suggesting fundamental paradigm limitations rather than model-specific deficiencies; and (iv) distance sensitivity experiments confirm that LLMs process spatial information but cannot translate this awareness into accurate interaction predictions. Our findings identify 3D spatial reasoning as the critical bottleneck and provide actionable guidance for developing structure-aware molecular foundation models.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 213

Loading