Keywords: Large language models; chemistry; benchmark; symbolic reasoning; molecule structure; evaluation; general models; domain-specific models; reproducibility; domain-specific evaluation; molecular reasoning;
TL;DR: CHEMSETS introduces a reproducible benchmark for chemical reasoning. Across 16 models, general LLMs outperform chemistry-specialized ones, with simple tasks nearly solved but translation and complex reasoning still open.
Abstract: Large Language Models (LLMs) have demonstrated immense versatility
and have been successfully adapted
to tackle numerous problems in scientific domains.
In chemistry, specialized LLMs have been recently developed
for molecule structure tasks such as molecule name conversion,
captioning, text-guided generation, and property or reaction prediction.
However, evaluations of chemistry-focused LLMs remain inconsistent
and often lack rigor:
new models are typically assessed
only on tasks they were explicitly trained for,
while compared models have been trained on different sets of tasks.
In addition, several proposed benchmarks introduce idiosyncratic features,
e.g., task-specific input or output tags,
and, thus, the LLMs' performance is
highly sensitive to prompting strategies,
answer formatting, and generation parameters,
further complicating reproducible evaluation.
To address these shortcomings,
we perform a standardized and reproducible method comparison
of chemical reasoning models on CHEMSETS,
a flexible benchmark suite integrated into lm-evaluation-harness.
CHEMSETS unifies existing benchmarks with
newly designed symbolically verifiable tasks,
thereby expanding both task diversity and difficulty.
Through this evaluation, we establish a fair leaderboard and provide new insights
into the limitations of recently proposed chemistry-aware LLMs.
We show that current chemistry LLMs exhibit limited generalization
beyond the specific tasks they were trained on.
Remarkably, across chemical tasks,
recent open-weight non-specialist reasoning models
outperform specialist models.
Submission Track: Benchmarking in AI for Materials Design - Short Paper
Submission Category: AI-Guided Design
Institution Location: (Linz, Austria)
Submission Number: 131
Loading