Can We Count on LLMs? The Fixed-Effect Fallacy and Claims of GPT-4 Capabilities

Published: 19 Oct 2024, Last Modified: 19 Oct 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: In this paper we explore evaluation of LLM capabilities. We present measurements of GPT-4 performance on several deterministic tasks; each task involves a basic calculation and takes as input parameter some element drawn from a large well-defined population (e.g., count elements in a list, multiply two k-digit numbers, etc). We examine several conditions per-task and perform enough trials so that statistically significant differences can be detected. This allows us to investigate the sensitivity of task-accuracy both to query phrasing and input parameter population. We find that seemingly trivial modifications in the task-prompt or input population can yield differences far larger than can be explained by sampling effects. For example, performance on a simple list-counting task varies with query-phrasing and list-length, but also with list composition (i.e., the thing-to-be-counted) and object frequency e.g., success when an element accounts for ≈ 50\% of a list is different from when it accounts for ≈ 70\% etc). We conclude that efforts to quantify LLM capabilities easily succumb to the language-as-fixed-effect fallacy, where experimental observations are improperly generalized beyond what the data supports. A consequence appears to be that intuitions that have been formed based on interactions with humans form a very unreliable guide as to which input modifications should ``make no difference'' to LLM performance.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Yu_Meng1
Submission Number: 2715
Loading