Keywords: Large Language Models, Chain-of-Thoughts, Chemical Benchmark.
TL;DR: ChemCoTBench bridges complex chemical reasoning with arithmetic-inspired step-by-step workflows, enabling LLMs to systematically tackle real-world tasks like molecular optimization and reaction prediction.
Abstract: While large language models (LLMs) with Chain-of-Thought (CoT) reasoning excel in mathematics and coding, their potential for systematic reasoning in chemistry, a domain demanding rigorous structural analysis for real-world tasks like drug design and reaction engineering, remains untapped. Current benchmarks focus on simple knowledge retrieval, neglecting step-by-step reasoning required for complex tasks such as molecular optimization and reaction prediction. To address this, we introduce ChemCoTBench, a reasoning framework that bridges molecular structure understanding with arithmetic-inspired operations, including addition, deletion, and substitution, to formalize chemical problem-solving into transparent, step-by-step workflows. By treating molecular transformations as modular "chemical operations", the framework enables slow-thinking reasoning, mirroring the logic of mathematical proofs while grounding solutions in real-world chemical constraints. We evaluate models on two high-impact tasks: Molecular Property Optimization and Chemical Reaction Prediction. These tasks mirror real-world challenges while providing structured evaluability. We further provide ChemCoTDataset, a pioneering 22,000-instance chemical reasoning dataset with expert-annotated chains of thought to facilitate LLM fine-tuning. By providing annotated trainable datasets, a reasoning taxonomy, and baseline evaluations, our work bridges the gap between abstract reasoning methods and practical chemical discovery, establishing a foundation for advancing LLMs as tools for AI-driven scientific innovation.
Croissant File: json
Dataset URL: https://huggingface.co/datasets/OpenMol/ChemCoTBench
Code URL: https://github.com/IDEA-XL/ChemCoTBench/
Supplementary Material: pdf
Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling
Flagged For Ethics Review: true
Submission Number: 566
Loading