Correlated Errors in Large Language Models

Elliot Myunghoon Kim; Avi Garg; Kenny Peng; Nikhil Garg

Correlated Errors in Large Language Models

Elliot Myunghoon Kim, Avi Garg, Kenny Peng, Nikhil Garg

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We evaluate the extent to which different language models are correlated in how they err.

Abstract: Diversity in training data, architecture, and providers is assumed to mitigate homogeneity in LLMs. However, we lack empirical evidence on whether different LLMs differ \textit{meaningfully}. We conduct a large-scale empirical evaluation on over 350 LLMs overall, using two popular leaderboards and a resume-screening task. We find substantial correlation in model errors---on one leaderboard dataset, models agree 60\% of the time when both models err. We identify factors driving model correlation, including shared architectures and providers. Crucially, however, larger and more accurate models have highly correlated errors, even with distinct architectures and providers. Finally, we show the effects of correlation in two downstream tasks: LLM-as-judge evaluation and hiring---the latter reflecting theoretical predictions regarding algorithmic monoculture.

Lay Summary: There are many available LLMs. One hope is these different LLMs offer some diversity. Diversity can be desirable for avoiding systemic failure (like when one job applicant is screened out of all jobs). Here, we study if different LLMs differ in a meaningful way. In particular, we study how correlated LLMs are in how they err. Using three datasets and over 350 LLMs, we find that LLMs agree on the wrong answer far more than they would at random, and that newer LLMs and LLMs from the same company tend to be more correlated. So even if LLMs differ on the surface, they may be converging under the hood.

Link To Code: https://github.com/nikhgarg/llm_correlated_errors_public

Primary Area: Social Aspects

Keywords: algorithmic monoculture, evaluations, LLMs

Submission Number: 15069

Loading