Large Language Model Benchmarks Do Not Test Reliability

Joshua Vendrow; Edward Vendrow; Sara Beery; Aleksander Madry

Large Language Model Benchmarks Do Not Test Reliability

Joshua Vendrow, Edward Vendrow, Sara Beery, Aleksander Madry

Published: 12 Oct 2024, Last Modified: 17 Dec 2024SafeGenAi PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: llm, reliability, language model, benchmark, gold, platinum, validated, logic, math, trustworthy

TL;DR: LLM benchmarks are too noisy to test reliability. We clean several benchmarks and find that LLMs still make simple mistakes, showing a gap in benchmarking practices.

Abstract: When deploying large language models (LLMs), it is important to ensure that these models are not only capable, but also *reliable*. Many benchmarks have been created to track LLMs’ growing capabilities. However, there has been no similar focus on measuring their reliability. To understand this landscape, we first investigate how well current benchmarks quantify model reliability. We find that pervasive label errors compromise these evaluations, obscuring lingering model failures and hiding unreliable behavior. Motivated by this gap in the evaluation of reliability, we propose the construction of so-called platinum benchmarks that are carefully curated to minimize label errors and ambiguity. As a first attempt at constructing such benchmarks, we revise examples from fifteen existing popular benchmarks. We evaluate a wide range of models on these platinum benchmarks and find that indeed, frontier LLMs still exhibit failures on simple tasks such as elementary-level math word problems. Analyzing these failures reveals previously unidentified patterns of questions on which frontier models consistently struggle.

Submission Number: 146

Loading