Molecular Generation through Reasoning with Large Language Models

ICLR 2026 Conference Submission17787 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models (LLMs), Molecule Generation, Supervised Fine-Tuning, Reinforcement Learning
TL;DR: We construct chain-of-thought (CoT) data to train reasoning LLMs for molecular generation.
Abstract: Molecule generation is significant for its potential in scientific discovery and practical applications, e.g., accelerating drug discovery by directly generating candidate molecules. Recent attempts often frame this task as a \textit{translation} problem from molecular caption to structural representation, such as SMILES. This paper first examines the feasibility of modeling the task as a reasoning process with large language models (LLMs), generating higher-quality molecules through structural decomposition and recombination within Chain-of-Thought (CoT). We then introduce a workflow for curating accurate CoT data, incorporating both machine and expert verification. Lastly, we demonstrate that with a limited dataset of $4{,}213$ high-quality samples, namely \textbf{MolCoT4K}, we elicit strong reasoning capabilities for molecule generation in open-source LLMs such as Qwen2.5-7B, achieving state-of-the-art exact match accuracy over strong open-source baselines (e.g., MolT5 and LlaSMol) as well as advanced commercial LLMs like GPT-4o. Moreover, the resulting model, \textbf{MolGeneration}, attains a Pass@16 exact match accuracy of 48.46\%, highlighting its strong potential for real-world experimental applications when supported by a feasible external verifier or chemistry experts. Our analysis shows that the correctness of the CoT path is crucial, while reasoning ability primarily enhances accuracy in fine-grained molecule generation. The dataset, model, and training codebase will be released to the community.
Supplementary Material: zip
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 17787
Loading