ChemEval: A Multi-level and Fine-grained Chemical Capability Evaluation for Large Language Models

Yuqing Huang; Rongyang Zhang; Xuesong He; Xuyang Zhi; Hao Wang; Nuo Chen; Liuzongbo; Xin Li; Feiyang Xu; Deguang Liu; Huadong Liang; YiLi; Jian Cui; Yin Xu; Shijin Wang; Guiquan Liu; Qi Liu; Defu Lian; Enhong Chen

ChemEval: A Multi-level and Fine-grained Chemical Capability Evaluation for Large Language Models

Yuqing Huang, Rongyang Zhang, Xuesong He, Xuyang Zhi, Hao Wang, Nuo Chen, Liuzongbo, Xin Li, Feiyang Xu, Deguang Liu, Huadong Liang, YiLi, Jian Cui, Yin Xu, Shijin Wang, Guiquan Liu, Qi Liu, Defu Lian, Enhong Chen

12 May 2025 (modified: 30 Oct 2025)Submitted to NeurIPS 2025 Datasets and Benchmarks TrackEveryoneRevisionsBibTeXCC BY-NC-SA 4.0

Keywords: Large Language Models, Benchmark, Chemical Knowledge Inference

Abstract: The emergence of Large Language Models (LLMs) in chemistry marks a significant advancement in applying artificial intelligence to chemical sciences. While these models show promising potential, their effective application in chemistry demands sophisticated evaluation protocols that address the field's inherent complexities. To bridge this critical gap, we introduce ChemEval, an innovative hierarchical assessment framework specifically designed to evaluate LLMs' capabilities across chemical domains. Our methodology incorporates a distinctive four-tier progression system, spanning from basic chemical concepts to advanced theoretical principles. Sixty-two textual and multimodal tasks are designed to enable researchers to conduct fine-grained analysis of model capabilities and achieve comprehensive evaluation via carefully crafted assessment protocols. The framework integrates carefully curated open-source datasets with expert-validated materials, ensuring both practical relevance and scientific rigor. In our experiments, we evaluated the performance of most main-stream LLMs using both zero-shot and few-shot approaches, with carefully designed examples and prompts. Results indicate that general-purpose LLMs, while proficient in understanding chemical literature and following instructions, struggle with tasks requiring deep chemical expertise. In contrast, chemical LLMs perform better in technical tasks but show limitations in general language processing. These findings highlight both the current limitations and future opportunities for LLMs in chemistry. Our research provides a systematic framework for advancing the application of artificial intelligence in chemical research, potentially facilitating new discoveries in the field.

Croissant File: json

Dataset URL: https://huggingface.co/datasets/Ooo1/ChemEval

Code URL: https://github.com/USTC-StarTeam/ChemEval

Primary Area: AL/ML Datasets & Benchmarks for life sciences (e.g. climate, health, life sciences, physics, social sciences)

Submission Number: 2454

Loading