ChemReason-Bench: Benchmarking Large Language Models for Procedural Reasoning in Experimental Chemistry

ChemReason-Bench: Benchmarking Large Language Models for Procedural Reasoning in Experimental Chemistry

ACL ARR 2026 January Submission9381 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Experimental procedure reasoning, Organic synthesis protocols, Verifiable benchmark, Multi-task, Schema-constrained outputs

Abstract: Experimental protocols in organic synthesis specify not only the intended transformation but also an executable sequence of operations and conditions. While recent language models show strong chemistry knowledge, widely used evaluations remain less diagnostic of procedure-level decision making. In this setting, correctness requires consistent step ordering, feasibility under stated conditions, faithful entity-role grounding, and schema-parseable outputs that can be automatically validated against operational constraints. We present ChemReason-Bench, a human-validated benchmark for verifiable experimental procedure reasoning built on a structured representation with explicit placeholders and a unified schema, enabling automatic checks of many operational constraints. From 500 reactions, we instantiate 7306 benchmark tasks across six complementary formats: ordering, step validation, condition validation, schema-constrained completion, contrastive choice, and evidence-grounded rationalization. We further release a large-scale instantiation of the same templates for downstream adaptation studies, kept disjoint from the evaluation set. Using a unified evaluation protocol, we benchmark diverse open-source, proprietary, and domain-specific models and observe clear variation across the capability surface. We also report controlled adaptation experiments in the appendix, where supervised fine-tuning improves small models, preference optimization adds limited gains in our setting, and a gap remains to the strongest evaluated systems.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, evaluation methodologies, metrics, NLP datasets, fine-tuning

Contribution Types: Data resources

Languages Studied: English

Submission Number: 9381

Loading