Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation

11 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Natural Language Processing, Molecule Discovery, LLM Benchmark
Abstract: Recently, Large Language Models (LLMs) have shown great potential in natural language-driven molecule discovery. However, existing datasets and benchmarks for molecule-text alignment are predominantly built on a one-to-one mapping, measuring LLMs' ability to retrieve a single, pre-defined answer, rather than their creative potential to generate diverse, yet equally valid, molecular candidates. To address this critical gap, we propose **S**peak-to-**S**tructure (**S$^2$-Bench**), the first benchmark to evaluate LLMs in open-domain natural language-driven molecule generation. S$^2$-Bench is specifically designed for one-to-many relationships, challenging LLMs to demonstrate genuine molecular understanding and generation capabilities. Our benchmark includes three key tasks: molecule editing (**MolEdit**), molecule optimization (**MolOpt**), and customized molecule generation (**MolCustom**), each probing a different aspect of molecule discovery. We also introduce **OpenMolIns**, a large-scale instruction tuning dataset that enables Llama-3.1-8B to surpass the most powerful LLMs like GPT-4o and Claude-3.5 on S$^2$-Bench. Our comprehensive evaluation of 28 LLMs shifts the focus from simple pattern recall to realistic molecular design, paving the way for more capable LLMs in natural language-driven molecule discovery.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 4020
Loading