Search Arena Meets Nuggets: Towards Explanations and Diagnostics in the Evaluation of LLM Responses

Sahel Sharifymoghaddam; Shivani Upadhyay; Nandan Thakur; Ronak Pradeep; Jimmy Lin

Search Arena Meets Nuggets: Towards Explanations and Diagnostics in the Evaluation of LLM Responses

Sahel Sharifymoghaddam, Shivani Upadhyay, Nandan Thakur, Ronak Pradeep, Jimmy Lin

19 Sept 2025 (modified: 28 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Retrieval-Augmented Generation, RAG, Nugget Evaluation, Search Arena, Automatic LLM Evaluation

TL;DR: A rigorous nugget evaluation framework to analyze data from 5K Search Arena battles provided by LMArena in a fully automatic manner.

Abstract: Battles, or side-by-side comparisons in so-called arenas that elicit human preferences, are used to assess the large language model (LLM) output quality, and have recently been extended to retrieval-augmented generation (RAG) systems. Although, battles mark progress in evaluation, they have two key limitations for complex information-seeking queries: they are neither explanatory nor diagnostic. On the other hand, nugget-based evaluation, that decomposes long-form answers into atomic facts and highlights necessary parts in an answer, has emerged as a promising strategy for RAG evaluation. In this work, we employ AutoNuggetizer, a nugget-based framework, to analyze ∼5K Search Arena battles from LMArena by automatically generating and assigning nuggets, converting each model response into a quantitative score. Our results show a 0.30 weighted Cohen’s κ score between nugget scores and human preferences. Notably, this result is on par with using an LLM as a judge for automatic evaluation, while substantially reducing the number of preference inversions. Furthermore, we provide in-depth analyses including inversions, nugget quality and shared-blindness effects, and so on. All our code and datasets will be released publicly upon paper acceptance.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 14693

Loading