Keywords: LLM, Evaluation, Reasoning
Abstract: Benchmark saturation and data contamination make it increasingly difficult to measure genuine scientific reasoning in frontier LLMs.
We introduce \textsc{Science Arena}, an olympiad-style evaluation suite built from five rubric-driven scientific competitions whose official materials were publicly released in Aug/Oct 2025 (plus IBO 2023 released in 2025), forming a practical time-release holdout against pretraining overlap.
To approximate real exam conditions, olympiad medalists grade model solutions with stepwise partial-credit rubrics, and we map scores to medal-calibrated human baselines.
Across Physics, Chemistry, and Biology theory exams, the best model reaches or exceeds the Human-Gold tier; it even surpasses the Human Winner on IBO'23, which uses a more objectively scorable (constrained-format) test, but remains below Human-Winner performance on open-ended IPhO/IChO, where long-form derivations and partial credit dominate.
We further show LLM-as-a-judge can substantially inflate scores on open-ended chemistry, motivating expert, rubric-faithful evaluation.
Finally, we provide an expert-derived cross-domain ability boundary that pinpoints remaining bottlenecks, such as visual grounding and long-horizon coherence.
(Note: The submitted PDF contains a minor inconsistency in the abstract regarding the biology exam(s); the main text is correct. The above abstract reflects the corrected wording.)
Paper Type: Short
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: corpus creation, benchmarking, NLP datasets, evaluation methodologies, evaluation, metrics, reproducibility, statistical testing for evaluation
Contribution Types: Model analysis & interpretability
Languages Studied: English, Chinese
Submission Number: 10508
Loading