Evaluating Large Language Models in Olympic-Level Physics Problems: A Benchmark Dataset

Evaluating Large Language Models in Olympic-Level Physics Problems: A Benchmark Dataset

ACL ARR 2024 April Submission562 Authors

16 Apr 2024 (modified: 21 May 2024)ACL ARR 2024 April SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) and Large Multimodal Models (LMMs) have demonstrated notable capabilities across a wide range of tasks and domains, showcasing advanced problem-solving skills that encompass everything from natural language understanding and generation to complex decision-making processes. However, the extent of their proficiency in tackling mathematical physics problems remains relatively underexplored. In this paper, we propose PhoPile, a high-quality, multimodal, physics-specific, and Olympic-level physics dataset. We detail the meticulous process of data collection, cleaning, and structuring to ensure the dataset's integrity and utility. Furthermore, we conduct a high-granularity evaluation of the performance of currently popular LLMs and LMMs on our dataset and provide a benchmark of their physics problem-solving capability and enrich assessment options for models' competencies in natural subjects. We also introduce an evaluation method that enables a more detailed measurement of the model’s reasoning capabilities. Our research represents the first attempt to reveal the potential and current limitations in interpreting and solving complex physics challenges, setting a foundational baseline for subsequent advancements in this field.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Large Language Models, Large Multimodal Models, Dataset, Physics

Contribution Types: Model analysis & interpretability, Data resources, Data analysis

Languages Studied: English

Submission Number: 562

Loading