Search Arena Meets Nuggets: Towards Explanations and Diagnostics in the Evaluation of LLM Responses

Search Arena Meets Nuggets: Towards Explanations and Diagnostics in the Evaluation of LLM Responses

ACL ARR 2026 January Submission7662 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large language model, Evaluation, Nuggetization

Abstract: Battles, or side-by-side comparisons in so-called arenas that elicit human preferences, are used to assess the large language model (LLM) output quality, and have recently been extended to retrieval-augmented generation (RAG) systems. Although, battles mark progress in evaluation, they have two key limitations for complex information-seeking queries: they are neither explanatory nor diagnostic. On the other hand, nugget-based evaluation, that decomposes long-form answers into atomic facts and highlights necessary parts in an answer, has emerged as a promising strategy for RAG evaluation. In this work, we employ Autonuggetizer, a nugget-based framework, to analyze ~5K Search Arena battles from LMArena by automatically generating and assigning nuggets, converting each model response into a quantitative score. We observe strong alignment between nugget-based Elo rankings and human preferences, with Kendall's tau of 0.71 and Spearman's rho of 0.88, exceeding the corresponding alignment achieved by LLM-as-a-judge evaluation (0.64 and 0.79, respectively), while substantially reducing the number of preference inversions. Furthermore, we provide in-depth analyses including inversions, nugget quality, and shared-blindness effects. All our code and datasets will be released publicly upon paper acceptance.

Paper Type: Long

Research Area: Information Extraction and Retrieval

Research Area Keywords: Information Extraction and Retrieval, Resources and Evaluation

Languages Studied: English, German, Chinese, Portuguese, Russian, French

Submission Number: 7662

Loading