Additional Submission Instructions: For the camera-ready version, please include the author names and affiliations, funding disclosures, and acknowledgements.
Track: Track 1: Original Research/Position/Education/Attention Track
Keywords: Large language models; Scientific benchmark; Multi-level knowledge evaluation
TL;DR: We propose the SciKnowEval benchmark, a novel framework that systematically evaluates LLMs across five progressive levels of scientific knowledge.
Abstract: Large language models (LLMs) are playing an increasingly important role in scientific research, yet there remains a lack of comprehensive benchmarks to evaluate the breadth and depth of scientific knowledge embedded in these models. To address this gap, we introduce SciKnowEval, a large-scale dataset designed to systematically assess LLMs across five progressive levels of scientific understanding: memory, comprehension, reasoning, discernment, and application. SciKnowEval comprises 28K multi-level questions and solutions spanning biology, chemistry, physics, and materials science. Using this benchmark, we evaluate 20 leading open-source and proprietary LLMs. The results show that while proprietary models often achieve state-of-the-art performance, substantial challenges remain—particularly in scientific reasoning and real-world application. We envision SciKnowEval as a standard benchmark for evaluating scientific capabilities in LLMs and as a catalyst for advancing more capable and reliable scientific language models.
Submission Number: 59
Loading