CHEMSETS: How Capable Are Chemistry LLMs?

Christoph Bartmann; Mykyta Ielanskyi; Johannes Schimunek; Philipp Seidl; Günter Klambauer; Sohvi Luukkonen

CHEMSETS: How Capable Are Chemistry LLMs?

Christoph Bartmann, Mykyta Ielanskyi, Johannes Schimunek, Philipp Seidl, Günter Klambauer, Sohvi Luukkonen

Published: 20 Sept 2025, Last Modified: 05 Nov 2025AI4Mat-NeurIPS-2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large language models; chemistry; benchmark; symbolic reasoning; molecule structure; evaluation; general models; domain-specific models; reproducibility; domain-specific evaluation; molecular reasoning;

TL;DR: CHEMSETS introduces a reproducible benchmark for chemical reasoning. Across 16 models, general LLMs outperform chemistry-specialized ones, with simple tasks nearly solved but translation and complex reasoning still open.

Abstract: Large Language Models (LLMs) have demonstrated immense versatility and have been successfully adapted to tackle numerous problems in scientific domains. In chemistry, specialized LLMs have been recently developed for molecule structure tasks such as molecule name conversion, captioning, text-guided generation, and property or reaction prediction. However, evaluations of chemistry-focused LLMs remain inconsistent and often lack rigor: new models are typically assessed only on tasks they were explicitly trained for, while compared models have been trained on different sets of tasks. In addition, several proposed benchmarks introduce idiosyncratic features, e.g., task-specific input or output tags, and, thus, the LLMs' performance is highly sensitive to prompting strategies, answer formatting, and generation parameters, further complicating reproducible evaluation. To address these shortcomings, we perform a standardized and reproducible method comparison of chemical reasoning models on CHEMSETS, a flexible benchmark suite integrated into lm-evaluation-harness. CHEMSETS unifies existing benchmarks with newly designed symbolically verifiable tasks, thereby expanding both task diversity and difficulty. Through this evaluation, we establish a fair leaderboard and provide new insights into the limitations of recently proposed chemistry-aware LLMs. We show that current chemistry LLMs exhibit limited generalization beyond the specific tasks they were trained on. Remarkably, across chemical tasks, recent open-weight non-specialist reasoning models outperform specialist models.

Submission Track: Benchmarking in AI for Materials Design - Short Paper

Submission Category: AI-Guided Design

Institution Location: (Linz, Austria)

Submission Number: 131

Loading