TL;DR: We evaluate the extent to which different language models are correlated in how they err.
Abstract: Diversity in training data, architecture, and providers is assumed to mitigate homogeneity in LLMs. However, we lack empirical evidence on whether different LLMs differ \textit{meaningfully}. We conduct a large-scale empirical evaluation on over 350 LLMs overall, using two popular leaderboards and a resume-screening task. We find substantial correlation in model errors---on one leaderboard dataset, models agree 60\% of the time when both models err. We identify factors driving model correlation, including shared architectures and providers. Crucially, however, larger and more accurate models have highly correlated errors, even with distinct architectures and providers. Finally, we show the effects of correlation in two downstream tasks: LLM-as-judge evaluation and hiring---the latter reflecting theoretical predictions regarding algorithmic monoculture.
Lay Summary: There are many available LLMs. One hope is these different LLMs offer some diversity. Diversity can be desirable for avoiding systemic failure (like when one job applicant is screened out of all jobs). Here, we study if different LLMs differ in a meaningful way. In particular, we study how correlated LLMs are in how they err. Using three datasets and over 350 LLMs, we find that LLMs agree on the wrong answer far more than they would at random, and that newer LLMs and LLMs from the same company tend to be more correlated. So even if LLMs differ on the surface, they may be converging under the hood.
Link To Code: https://github.com/nikhgarg/llm_correlated_errors_public
Primary Area: Social Aspects
Keywords: algorithmic monoculture, evaluations, LLMs
Submission Number: 15069
Loading