Search Arena Meets Nuggets: Towards Explanations and Diagnostics in the Evaluation of LLM Responses

ICLR 2026 Conference Submission14693 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Retrieval-Augmented Generation, RAG, Nugget Evaluation, Search Arena, Automatic LLM Evaluation
TL;DR: A rigorous nugget evaluation framework to analyze data from 5K Search Arena battles provided by LMArena in a fully automatic manner.
Abstract: Battles, or side-by-side comparisons in so-called arenas that elicit human preferences, are used to assess the large language model (LLM) output quality, and have recently been extended to retrieval-augmented generation (RAG) systems. Although, battles mark progress in evaluation, they have two key limitations for complex information-seeking queries: they are neither explanatory nor diagnostic. On the other hand, nugget-based evaluation, that decomposes long-form answers into atomic facts and highlights necessary parts in an answer, has emerged as a promising strategy for RAG evaluation. In this work, we employ AutoNuggetizer, a nugget-based framework, to analyze ∼5K Search Arena battles from LMArena by automatically generating and assigning nuggets, converting each model response into a quantitative score. Our results show a 0.30 weighted Cohen’s κ score between nugget scores and human preferences. Notably, this result is on par with using an LLM as a judge for automatic evaluation, while substantially reducing the number of preference inversions. Furthermore, we provide in-depth analyses including inversions, nugget quality and shared-blindness effects, and so on. All our code and datasets will be released publicly upon paper acceptance.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 14693
Loading