Keywords: Large Language Model, Benchmark
TL;DR: Introduce a comprehensive and annually-updated benchmark and reveal a unique finding that high scores on GAOKAO do not reflect human-aligned capabilities in LLMs.
Abstract: Large Language Models (LLMs) are commonly evaluated using human-crafted benchmarks, under the premise that higher scores implicitly reflect stronger human-like performance. However, there is growing concern that LLMs may “game” these benchmarks due to data leakage, achieving high scores while struggling with tasks straightforward for humans.
To substantively address the problem, we create GAOKAO-Eval, a comprehensive benchmark based on China's National College Entrance Examination (Gaokao) and conduct closed-book evaluations for representative models released prior to Gaokao.
Contrary to prevailing consensus, even when addressing data leakage and comprehensiveness, GAOKAO-Eval reveals that high scores still fail to truly reflect human-aligned capabilities. To better understand this mismatch, We introduce the Rasch model from cognitive psychology to analyze LLM scoring patterns and identify two key discrepancies: 1) anomalous consistant performance across various question difficultiess, and 2) high variance in performance on questions of similar difficulty. In addition, we identified inconsistent grading of LLM-generated answers among teachers and recurring mistake patterns. we find that the phenomenon are well-grounded in the motivations behind OpenAI o1, and o1's reasoning-as-difficulties can mitigate the mismatch. These results show that GAOKAO-Eval can reveal limitations in LLM capabilities not captured by current benchmarks and highlight the need for more LLM-aligned difficulty analysis.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4217
Loading