Beyond accuracy: understanding the performance of LLMs on exams designed for humans

Pedro Calais; Gabriel Franco; Themistoklis Nikas; Zilu Tang; Mark Crovella; Wagner Meira Jr.; Evimaria Terzi

Beyond accuracy: understanding the performance of LLMs on exams designed for humans

Pedro Calais, Gabriel Franco, Themistoklis Nikas, Zilu Tang, Mark Crovella, Wagner Meira Jr., Evimaria Terzi

15 May 2024 (modified: 06 Nov 2024)Submitted to NeurIPS 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language models, model evaluation, psychometrics

TL;DR: We apply traditional psychometrics tools to evaluate the performance of large language models and compare their patterns of correct and incorrect answers against a large dataset of human students doing college-entrance level exams.

Abstract: Many recent studies of LLM performance have focused on the ability of LLMs to achieve outcomes comparable to humans on academic and professional exams. However, it is not clear whether such studies shed light on the extent to which models show reasoning ability, and there is controversy about the significance and implications of such results. We seek to look more deeply into the question of how and whether the performance of LLMs on exams designed for humans reflects true aptitude inherent in LLMs. We do so by making use of the tools of psychometrics which are designed to perform meaningful measurement in test taking. We leverage a unique dataset that captures the detailed performance of over 5M students across 8 college-entrance exams given over a span of two years in Brazil. With respect to the evaluation of LLM abilities, we show that the tools of Item Response Theory (IRT) provide a more informative evaluation of model performance than the usual accuracy metrics employed in previous studies. Digging deeper, we show that the modeling framework of IRT, by explicitly modeling the difficulty levels of questions, allows us to quantitatively distinguish between LLMs that answer questions in “human-like” patterns versus LLMs that do not. We also show how to quantitatively identify cases in which exam results are not reliable measurements of an LLM's ability. Using the tools of IRT we can also identify specific questions that appear to be either much easier, or much harder, for machines than for humans, and we give some reasons for those differences. Overall, our study shows that the conventional focus on accuracy as the primary performance metric for LLM studies does not allow us to deeply understand the true capabilities of LLMs and compare them to that of humans. Thus, we claim that psychometric modeling should play a larger role in the evaluation of LLM capabilities on exams designed for humans.

Supplementary Material: zip

Primary Area: Evaluation (methodology, meta studies, replicability and validity)

Submission Number: 19331

Loading