Abstract: LLM (large language model) practitioners commonly notice that outputs
can vary for the same inputs under settings expected to be deterministic.
Yet the questions of how pervasive this is, and with what impact on results,
have not to our knowledge been systematically investigated. We investigate
non-determinism in five LLMs configured to be deterministic when applied
to eight common tasks in across 10 runs, in both zero-shot and few-shot set-
tings. We see accuracy variations up to 15% across naturally occurring runs
with a gap of best possible performance to worst possible performance up
to 70%. In fact, none of the LLMs consistently delivers repeatable accuracy
across all tasks, much less identical output strings. Sharing preliminary
results with insiders has revealed that non-determinism perhaps essen-
tial to the efficient use of compute resources via co-mingled data in input
buffers so this issue is not going away anytime soon. To better quantify
our observations, we introduce metrics focused on quantifying determin-
ism, TARr@N for the total agreement rate at N runs over raw output, and
TARa@N for total agreement rate of parsed-out answers. Our code and data
are publicly available at https://github.com/breckbaldwin/llm-stability.
Loading