Science Arena: Time-Release Olympiad Exams for Rubric-Faithful Evaluation of Scientific Reasoning

Science Arena: Time-Release Olympiad Exams for Rubric-Faithful Evaluation of Scientific Reasoning

ACL ARR 2026 January Submission10508 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, Evaluation, Reasoning

Abstract: Benchmark saturation and data contamination make it increasingly difficult to measure genuine scientific reasoning in frontier LLMs. We introduce \textsc{Science Arena}, an olympiad-style evaluation suite built from five rubric-driven scientific competitions whose official materials were publicly released in Aug/Oct 2025 (plus IBO 2023 released in 2025), forming a practical time-release holdout against pretraining overlap. To approximate real exam conditions, olympiad medalists grade model solutions with stepwise partial-credit rubrics, and we map scores to medal-calibrated human baselines. Across Physics, Chemistry, and Biology theory exams, the best model reaches or exceeds the Human-Gold tier; it even surpasses the Human Winner on IBO'23, which uses a more objectively scorable (constrained-format) test, but remains below Human-Winner performance on open-ended IPhO/IChO, where long-form derivations and partial credit dominate. We further show LLM-as-a-judge can substantially inflate scores on open-ended chemistry, motivating expert, rubric-faithful evaluation. Finally, we provide an expert-derived cross-domain ability boundary that pinpoints remaining bottlenecks, such as visual grounding and long-horizon coherence. (Note: The submitted PDF contains a minor inconsistency in the abstract regarding the biology exam(s); the main text is correct. The above abstract reflects the corrected wording.)

Paper Type: Short

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: corpus creation, benchmarking, NLP datasets, evaluation methodologies, evaluation, metrics, reproducibility, statistical testing for evaluation

Contribution Types: Model analysis & interpretability

Languages Studied: English, Chinese

Submission Number: 10508

Loading