SCIBENCH: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Published: 28 Oct 2023, Last Modified: 16 Nov 2023MATH-AI 23 PosterEveryoneRevisionsBibTeX
Keywords: Large Language Model Benchmark
Abstract: Recent advances in Large Language Models (LLMs) have demonstrated notable progress on many mathematical benchmarks. However, most of these benchmarks only contain problems grounded in junior and senior high school subjects, contain only multiple-choice questions, and are confined to a limited scope of elementary arithmetic operations. To address these issues, this paper introduces an expansive benchmark suite Scibench that aims to systematically examine the reasoning capabilities required for solving complex scientific problems. Scibench contains two datasets: an open set featuring a range of collegiate-level scientific problems, and a closed set comprising problems from undergraduate-level exams. Based on the two datasets, we conduct an in-depth benchmarking study of five representative LLMs with various prompting strategies. Furthermore, through a detailed user study, we show that no single prompting strategy significantly outperforms the others and some strategies that demonstrate improvements in certain problem-solving skills could result in declines in other skills.
Submission Number: 44