Keywords: Large Language Models, Natural Language Processing, Molecule Discovery, LLM Benchmark
Abstract: Recently, Large Language Models (LLMs) have shown great potential in natural language-driven molecule discovery.
However, existing datasets and benchmarks for molecule-text alignment are predominantly built on a one-to-one mapping, measuring LLMs' ability to retrieve a single, pre-defined answer, rather than their creative potential to generate diverse, yet equally valid, molecular candidates.
To address this critical gap, we propose **S**peak-to-**S**tructure (**S$^2$-Bench**),
the first benchmark to evaluate LLMs in open-domain natural language-driven molecule generation.
S$^2$-Bench is specifically designed for one-to-many relationships, challenging LLMs to demonstrate genuine molecular understanding and generation capabilities.
Our benchmark includes three key tasks: molecule editing (**MolEdit**), molecule optimization (**MolOpt**), and customized molecule generation (**MolCustom**), each probing a different aspect of molecule discovery.
We also introduce **OpenMolIns**, a large-scale instruction tuning dataset that enables Llama-3.1-8B to surpass the most powerful LLMs like GPT-4o and Claude-3.5 on S$^2$-Bench.
Our comprehensive evaluation of 28 LLMs shifts the focus from simple pattern recall to realistic molecular design, paving the way for more capable LLMs in natural language-driven molecule discovery.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 4020
Loading