GAOKAO-Eval: Does High Scores Truly Reflect Strong Capabilities in LLMs?

Zhikai Lei; Tianyi Liang; Hanglei Hu; Jin Zhang; Hang Yan; Qipeng Guo; Yunhua Zhou; Yunfan Shao; Linyang Li

GAOKAO-Eval: Does High Scores Truly Reflect Strong Capabilities in LLMs?

Zhikai Lei, Tianyi Liang, Hanglei Hu, Jin Zhang, Hang Yan, Qipeng Guo, Yunhua Zhou, Yunfan Shao, Linyang Li

25 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, Benchmark

TL;DR: Introduce a comprehensive and annually-updated benchmark and reveal a unique finding that high scores on GAOKAO do not reflect human-aligned capabilities in LLMs.

Abstract: Large Language Models (LLMs) are commonly evaluated using human-crafted benchmarks, under the premise that higher scores implicitly reflect stronger human-like performance. However, there is growing concern that LLMs may “game” these benchmarks due to data leakage, achieving high scores while struggling with tasks straightforward for humans. To substantively address the problem, we create GAOKAO-Eval, a comprehensive benchmark based on China's National College Entrance Examination (Gaokao) and conduct closed-book evaluations for representative models released prior to Gaokao. Contrary to prevailing consensus, even when addressing data leakage and comprehensiveness, GAOKAO-Eval reveals that high scores still fail to truly reflect human-aligned capabilities. To better understand this mismatch, We introduce the Rasch model from cognitive psychology to analyze LLM scoring patterns and identify two key discrepancies: 1) anomalous consistant performance across various question difficultiess, and 2) high variance in performance on questions of similar difficulty. In addition, we identified inconsistent grading of LLM-generated answers among teachers and recurring mistake patterns. we find that the phenomenon are well-grounded in the motivations behind OpenAI o1, and o1's reasoning-as-difficulties can mitigate the mismatch. These results show that GAOKAO-Eval can reveal limitations in LLM capabilities not captured by current benchmarks and highlight the need for more LLM-aligned difficulty analysis.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4217

Loading