PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems
Abstract: The discipline of physics stands as a cornerstone of human intellect, driving the evolution of technology and deepening our understanding of the fundamental principles of the cosmos. Contemporary literature includes some works centered on the task of solving physics problems—a crucial domain of natural language reasoning. In this paper, we evaluate the performance of frontier LLMs in solving physics problems, both mathematical and descriptive. We also employ a plethora of inference-time techniques and agentic frameworks to improve the performance of the models. This includes the verification of proposed solutions in a cumulative fashion by other, smaller LLM agents, and we perform a comparative analysis of the performance that the techniques entail. There are significant improvements when the multi-agent framework is applied to problems that the models initially perform poorly on. Furthermore, we introduce a new evaluation benchmark for physics problems, PhysicsEval, consisting of 19,609 problems sourced from various physics textbooks and their corresponding correct solutions scraped from physics forums and educational websites.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Reasoning, Physics, Problem-solving, Large Language Models, Evaluation, Benchmark Dataset, NLP Applications
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
Previous URL: https://openreview.net/forum?id=PHFai3Mv0A
Explanation Of Revisions PDF: pdf
Reassignment Request Area Chair: No, I want the same area chair from our previous submission (subject to their availability).
Reassignment Request Reviewers: No, I want the same set of reviewers from our previous submission (subject to their availability)
Justification For Not Keeping Action Editor Or Reviewers: N/A
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: N/A
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: The dataset sources have been listed in Appendix C.3.
B2 Discuss The License For Artifacts: N/A
B3 Artifact Use Consistent With Intended Use: N/A
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B5 Documentation Of Artifacts: N/A
B6 Statistics For Data: Yes
B6 Elaboration: The dataset split has been mentioned in Section 3.
C Computational Experiments: Yes
C1 Model Size And Budget: N/A
C2 Experimental Setup And Hyperparameters: N/A
C3 Descriptive Statistics: Yes
C3 Elaboration: It has been discussed in Section 4.
C4 Parameters For Packages: N/A
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: No
E1 Elaboration: AI was strictly used to improve the clarity of the language.
Author Submission Checklist: yes
Submission Number: 1463
Loading