Keywords: Causality, Language Models, Evaluation
Abstract: State-of-the-art large language models rely on randomization to respond to a prompt. Consequently, a model may respond differently to the same prompt if asked multiple times. In this work, we argue that the evaluation and ranking of large language models should control for this randomization. Our starting point is the development of a causal model for coupled autoregressive generation, which allows different large language models to sample responses with the same source of randomness. Building upon our causal model, we first show that, on evaluations based on benchmark datasets, coupled autoregressive generation leads to the same conclusions as vanilla autoregressive generation but using provably fewer samples. However, we further show that, on evaluations based on pairwise comparisons, the two approaches can surprisingly lead to different rankings when comparing more than two models. To complement our theoretical results, we conduct experiments with models from the Llama, Mistral and Qwen families, popular benchmark datasets, and prompts from the LMSYS Chatbot Arena platform.
Submission Number: 27
Loading