Towards Optimized Use of LLMs in Drug Discovery

07 Sept 2025 (modified: 16 Oct 2025)Submitted to NeurIPS 2025 2nd Workshop FM4LSEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Drug Discovery, Protein-Ligand Optimization, Absolute Binding Free Energy, Fine-Tuning
TL;DR: We make several optimizations to current SOTA methods for LLMs in molecular optimization, focusing on maximizing binding free energy results for generated compounds.
Abstract: Large language models (LLMs) have recently emerged as a promising tool for small-molecule generation in drug discovery. One notable recent work in this field is MOLLEO, which combines an evolutionary algorithm with an LLM that acts as the operator for making crossovers and mutations on the ligand population. MOLLEO demonstrates strong results on optimizing molecular docking scores, but several aspects of their model are not well suited to real-world drug discovery. In this work, we make a set of novel optimizations that greatly improve the efficacy of LLMs in small-molecule drug discovery. First, we show that MOLLEO's use of molecular docking as the fitness function results in ligands unlikely to show experimental binding using molecular dynamics simulations. We find that replacing docking with the recently released biomolecular foundation model Boltz-2 greatly improves the predicted binding affinity from molecular dynamics. Second, we incorporate knowledge of existing ligands, which is present in most practical drug discovery scenarios, using ligands from BindingDB instead of ZINC250k as the starting population for the genetic algorithm. Third, we fine-tune a version of Llama to better modify existing ligands towards higher activity, and find that its use in MOLLEO significantly improves the quality of generated ligands over the base Llama model. We demonstrate our results on the receptor tyrosine kinase c-MET, a crucial protein that drives the growth of various human cancers.
Submission Number: 87
Loading