SciKnowEval: A Comprehensive Dataset for Evaluating Scientific Knowledge of Large Language Models

Kehua Feng; Xinyi Shen; Weijie Wang; Xiang Zhuang; Yuqi Tang; Qiang Zhang; Keyan Ding

SciKnowEval: A Comprehensive Dataset for Evaluating Scientific Knowledge of Large Language Models

Kehua Feng, Xinyi Shen, Weijie Wang, Xiang Zhuang, Yuqi Tang, Qiang Zhang, Keyan Ding

Published: 24 Sept 2025, Last Modified: 26 Dec 2025NeurIPS2025-AI4Science PosterEveryoneRevisionsBibTeXCC BY 4.0

Additional Submission Instructions: For the camera-ready version, please include the author names and affiliations, funding disclosures, and acknowledgements.

Track: Track 1: Original Research/Position/Education/Attention Track

Keywords: Large language models; Scientific benchmark; Multi-level knowledge evaluation

TL;DR: We propose the SciKnowEval benchmark, a novel framework that systematically evaluates LLMs across five progressive levels of scientific knowledge.

Abstract: Large language models (LLMs) are playing an increasingly important role in scientific research, yet there remains a lack of comprehensive benchmarks to evaluate the breadth and depth of scientific knowledge embedded in these models. To address this gap, we introduce SciKnowEval, a large-scale dataset designed to systematically assess LLMs across five progressive levels of scientific understanding: memory, comprehension, reasoning, discernment, and application. SciKnowEval comprises 28K multi-level questions and solutions spanning biology, chemistry, physics, and materials science. Using this benchmark, we evaluate 20 leading open-source and proprietary LLMs. The results show that while proprietary models often achieve state-of-the-art performance, substantial challenges remain—particularly in scientific reasoning and real-world application. We envision SciKnowEval as a standard benchmark for evaluating scientific capabilities in LLMs and as a catalyst for advancing more capable and reliable scientific language models.

Submission Number: 59

Loading