Conversational Gold: Evaluating Personalized Conversational
Search System using Gold Nuggets
Zahra Abbasiantaeb∗
University of Amsterdam
Amsterdam, The Netherlands
z.abbasiantaeb@uva.nl
Simon Lupart∗
University of Amsterdam
Amsterdam, The Netherlands
s.c.lupart@uva.nl
Leif Azzopardi
University of Strathclyde
Glasgow, Scotland, UK
leif.azzopardi@strath.ac.uk
Jeffery Dalton
University of Edinburgh
Edinburgh, Scotland, UK
jeff.dalton@ed.ac.uk
Mohammad Aliannejadi
University of Amsterdam
Amsterdam, The Netherlands
m.aliannejadi@uva.nl
Abstract
The rise of personalized conversational search systems has been
driven by advancements in Large Language Models (LLMs), en-
abling these systems to retrieve and generate answers for complex
information needs. However, the automatic evaluation of responses
generated by Retrieval Augmented Generation (RAG) systems re-
mains an understudied challenge. In this paper, we introduce a new
resource for assessing the retrieval effectiveness and relevance of
response generated by RAG systems, using a nugget-based evalua-
tion framework. Built upon the foundation of TREC iKAT 2023, our
dataset extends to the TREC iKAT 2024 collection, which includes 17
conversations and 20,575 relevance passage assessments, together
with 2,279 extracted gold nuggets, and 62 manually written gold
answers from NIST assessors. While maintaining the core structure
of its predecessor, this new collection enables a deeper exploration
of generation tasks in conversational settings. Key improvements
in iKAT 2024 include: (1) “gold nuggets” — concise, essential pieces
of information extracted from relevant passages of the collection
— which serve as a foundation for automatic response evaluation;
(2) manually written answers to provide a gold standard for re-
sponse evaluation; (3) unanswerable questions to evaluate model
hallucination; (4) expanded user personas, providing richer contex-
tual grounding; and (5) a transition from Personal Text Knowledge
Base (PTKB) ranking to PTKB classification and selection. Built
on this resource, we provide a framework for long-form answer
generation evaluation, involving nuggets extraction and nuggets
matching, linked to retrieval. This establishes a solid resource for
advancing research in personalized conversational search and long-
form answer generation. Our resources are publicly available at
https://github.com/irlabamsterdam/CONE-RAG.
∗Equal contributions.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
Conference acronym ’XX, Woodstock, NY
© 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-XXXX-X/2018/06
https://doi.org/XXXXXXX.XXXXXXX
CCS Concepts
• Information systems →Test collections.
Keywords
Conversational Information Seeking, Retrieval-Augmented Genera-
tion, Evaluation, Information Nuggets, Test Collection
ACM Reference Format:
Zahra Abbasiantaeb, Simon Lupart, Leif Azzopardi, Jeffery Dalton, and Mo-
hammad Aliannejadi. 2018. Conversational Gold: Evaluating Personalized
Conversational Search System using Gold Nuggets. In Proceedings of Make
sure to enter the correct conference title from your rights confirmation email
(Conference acronym ’XX). ACM, New York, NY, USA, 11 pages. https:
//doi.org/XXXXXXX.XXXXXXX
1
Introduction
Conversational information seeking provides a natural and intu-
itive way for users to interact and discover relevant information
through dialogue with an agent [9, 32]. Large Language Mod-
els (LLMs) [15] have taken us one step further to having access
to Conversational Search Agents (CSAs) (e.g., ChatGPT, BingChat,
Gemini, and BlenderBot). Moreover, wide-scale research and devel-
opment in this area is more possible than ever, given the relative
accessibility of the technology. However, despite advances and
having numerous resources available for evaluating LLMs over a
variety of tasks, CSAs present their own unique evaluation chal-
lenges [3]. The complexity of CSAs stems from their interactive
and nature [10]. Conversations can evolve in various ways with
varied discourse [29] and mixed initiatives [7]. On top of that, con-
versational interactions are expected to be more personalized and
tailored to the user than standard search systems [44].
LLMs have greatly influenced how IR research is shaped [8].
Evaluation has been one of the main themes in the latest IR strate-
gic meetings [8], highlighting the impact of LLMs on the way we
consume and evaluate information. This has led to increased com-
munity engagement in LLMs-based evaluation [4, 17, 35, 39] leading
to ongoing workshops like LLM4Eval [33, 34] and Eval4RAG.1 Cur-
rent research emphasizes on the importance of evaluating Retrieval
Augmented Generation (RAG) systems and its challenges [35, 37].
1https://eval4rag.github.io/
arXiv:2503.09902v1  [cs.IR]  12 Mar 2025

Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
Abbasiantaeb et al
TREC Conversational Assistance Track (CAsT) [29] ran for four
years, aiming to provide a strong evaluation framework for conver-
sational information-seeking tasks. TREC Interactive Knowledge
Assistance Track (iKAT) 2023 [5] took a step further, providing com-
plex decisional tasks that require multi-turn collaborative human-
agent interactions. TREC iKAT 2023 also incorporated the personal
knowledge graph in each dialogue, representing the noisy knowl-
edge of the system about the user. Multiple TREC tacks in 2024
such as TREC RAG,2 iKAT,3 and NeuCLIR 4 focused on providing
reusable RAG collections, highlighting the importance and signifi-
cance of this research direction.
RAG evaluation faces several challenges as the nature of the task
is complex and multi-step, which has resulted in a multitude of
possible configurations [4]. Given that the generated response of
a RAG system is a mix of the retrieved documents and the LLM’s
internal knowledge, existing surface-based QA metrics are not ideal
for evaluation [35]. Besides, existing RAG collections offer more
complex information needs, as opposed to ad-hoc retrieval and
QA [37], calling for measuring the completeness of the generated
responses. Existing research builds [25, 31, 37, 38] on the ideas of
nugget-based evaluation for summarization [18], where the gener-
ated response is broken into information nuggets and compared
against a set of gold nuggets for the information need. However,
most prior work relies on LLM-generated nuggets, without having
studied the effectiveness of LLM-based nugget extraction, poten-
tially leading to unforeseen pitfalls and biases.
To address these limitations, in this work, we present experi-
ment and extend the resources we created in TREC iKAT 2024. In
particular, we present a set of complex multi-turn conversational
topics, assessed for document relevance by the NIST assessors. We
also collect human-extracted information nuggets from the rele-
vant documents, along with a human crafted gold response (given
the nuggets). We then conduct extensive experiments on the ef-
fectiveness of LLM-extracted nuggets in comparison with human
nuggets. Furthermore, we conduct a crowd-sourced human nugget
matching study to assess both different LLMs’ nugget matching
capabilities, as well as end-to-end RAG evaluation. Based on our
experiments and collected data, we propose a novel extensible RAG
evaluation framework, called CONE-RAG, which could effectively ex-
tract and match nuggets of a RAG response, and measure multiple
nugget- and surface-based metrics. We publicly release conversa-
tional topics, document relevance assessments, human nuggets,
gold responses, crowdsourced nugget matching labels, and CONE-
RAG. We believe that our experiments and provided resources will
foster research in conversational RAG and nugget-based evaluation,
as it provides useful insights on the problem.
2
Related Work
Development and evaluation of the interactive CSA is an interesting
perspective in Information Retrieval (IR). The existing research [12–
14, 27, 29, 41–43] tried to facilitate the development of conversa-
tional search systems by proposing standard test collections. The
TREC Interactive Track (1998–2002) [27] and the TREC Dynamic
2https://trec-rag.github.io/
3https://www.trecikat.com/
4https://neuclir.github.io/
Domain Track (2015–2017) [41–43] provided resources for passage
retrieval across multiple rounds of feedback, focusing on itera-
tive refinement rather than conversational interactions. The TREC
Conversational Assistance Track (CAsT) [12–14, 27, 29, 41–43] was
one of the first attempts to provide resource for conversational
search task. The track ran over four years resulting in TREC CAsT
2019, 2020, 2021, and 2022 test collections. The track evolved over
four years by (1) making the conversations more complex, longer,
and more dependent on the previous user–system interactions, (2)
adding mixed-initiative interactions (clarification, feedback, elicita-
tion, and etc) and (3) making the conversations multi-path based
on different trajectories. These efforts resulted in more realistic and
challenging conversational search scenarios.
The TREC Interactive Knowledge Assistance Track (iKAT) [5]
evolved the TREC CAsT into a new track by making the conver-
sations personalized. In personalized search given the same user
query, the response of the system would be different to different
users with different personas. The TREC iKAT 2023 enhanced the
conversations with the persona of the users and added more com-
plex information needs. The persona of the user is provided as a
set of natural language sentences and is static during the conver-
sation. The system needs to do reasoning over context, persona,
and different sources of information to respond to the complex
information needs. The MTRAG collection focuses on the evalua-
tion of multi-turn RAG systems by providing manually collected
and simulated topics. However, the dataset lacks a benchmark for
passage retrieval. In this work, we propose the TREC iKAT 2024
collection which includes the resources for passage retrieval and
response generation over personalized search.
Despite the advancements in RAG systems and response gener-
ation task by the appearance of LLMs, the evaluation of the RAG
systems still remains a challenge. The reference-based metrics such
as Rouge [23] and BLEU [30] are commonly used to evaluate RAG
systems by measuring the overlap between generated responses and
reference texts, assessing lexical similarity and relevance in terms of
precision, recall, and n-gram matching. One line of research focus
on using LLMs for evaluation of RAG systems [16, 21, 36, 45, 46].
The BERGEN [35] tool evaluates the RAG systems by measuring
the reference-based and LLM-based metrics. Another line of re-
search [31, 37] is attracted to the nugget-based evaluation of the
RAG systems. ICAT is an evaluation framework designed to as-
sess the coverage of diverse factual information in long-form text
generation. It decomposes the generated text into atomic claims,
verifies each claim by retrieving information from a reliable knowl-
edge source, and evaluates the alignment between these claims and
the key aspects expected in the output. Nugget-based evaluation
has been first proposed in TREC Question Answering Track in
2003 [40]. The TREC 2024 Retrieval Augmented Generation (RAG)
Track [31] tries to automate nugget-based evaluation with the aid
of LLMs to automatically extract and assign the nuggets. In this
work, we propose both manually collected and LLM-based nuggets
of information for the TREC iKAT 2024 collection. In addition, we
propose an automatic pipeline for extracting the nuggets of infor-
mation from the input text (can be either a response or a passage),
matching them with the set of gold nuggets, and measuring the
nugget recall and precision metrics for conversational search.

Conversational Gold: Evaluating Personalized Conversational Search System using Gold Nuggets
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
3
Resources
The goal of TREC iKAT is to advance the evaluation of the personal-
ized CSAs. To this aim, the track breaks the task of CSAs into three
distinct components including (1) passage retrieval, (2) classification
of the user persona, and (3) response generation. In this work, we
provide resources for different components of a CSA by extending
the existing resources from [5] and [6], and by providing an auto-
matic framework for evaluation, called CONE-RAG, of the quality of
responses generated by RAG systems. In addition, we provide LLM-
based resources for different components of the personalized CSA.
We categorize and divide our resources into four groups, namely,
(1) human-annotated resources collected by both NIST assessors
and crowdsourcing, (2) resources collected by the aid of LLMs, (3)
our new RAG evaluation pipeline (CONE-RAG), and (4) participants
runs. We will further explain these resources in the following.
3.1
Human-Annotated Resources
Personalized conversational topics. The TREC iKAT 2024 col-
lection includes 17 topics and each topic is associated with one
distinct PTKB (persona). Different from iKAT 2023 collection, one
dialogue is developed for each topic to cover a larger array of topics
and have a more diverse set of topics. In total, the dataset includes
218 user–system turns where each dialogue on average has 12.82
user–system turns. Each user–system turn of the conversation in-
cludes (1) user utterance, (2) resolved user utterance, (3) canonical
response grounded on the passages from collection, (4) response
provenance, and (5) PTKB provenance. The topics include longer
and more complete responses compared to the TREC iKAT 2023
collection. The average length of canonical responses is 95.29 words
while it is 77.26 for the iKAT 2023 collection.
For topic development, we mainly followed the same procedure
and guidelines used for the development of the TREC iKAT 2023
collection (see [5]) with some modifications. During the topic de-
velopment, we discarded the topics that were deemed too easy or
too difficult, based on a preliminary GPT4 relevance assessment.
We used the automatic relevance judgment model proposed by [2]
for relevance assessment of personalized CSAs.
Collection. We use the existing collection from TREC iKAT 2023
which is a subset of ClueWeb22-B [28] and includes 116,838,987
passages. We use the same segmentation code used in iKAT 2023
for segmenting the documents. The iKAT 2023 dataset included
a Pyserini [24] index for the collection. We extend the existing
collection from iKAT 2023 by publishing a newly learned sparse
retrieval index. This index uses CoCondenser SelfDistil SPLADE++
checkpoint from HuggingFace5, using a numba index, as in the
original SPLADE GitHub repository.
Passage relevance assessment. The relevance of the passages
is assessed using the same scale of scores (0–4) used for TREC
iKAT 2023 and CAsT collections. The NIST assessors judged the
relevance of passages. We selected a subset of 116 turns from 218
turns for relevance assessment by discarding the very general and
clarification turns. We tried to keep the turns with relevant PTKB
statements for relevance assessment.
5naver/splade-cocondenser-selfdistil
Adaptive pooling. As the existing research [2] has shown reusabil-
ity issues for the TREC iKAT 2023 dataset, we tried to mitigate this
challenge by using an adaptive pooling approach rather than the
existing static pooling approach. The adaptive pooling enabled us
to assess up to the top 30 passages, leading to considerably more
relevant passages. To do so, we first leveraged automated GPT4o rel-
evance assessment. The existing research [2, 17] shows that LLMs
tend to be more forgiving than human annotators (i.e., the average
relevance scores are higher) while exhibiting a relatively low false
negative rate. Based on these findings, we filtered out the passages
in the top 30 pool and asked the NIST assessors to only assess the
passages that were deemed relevant by GPT4o. To avoid reinforcing
GPT4 biases and the LLM evaluation circularity problem, included
all the top 5 passages in the final assessment pool. But for the pas-
sages ranked between the top 5 and top 30, we applied the LLM
filtering. In doing so, the size of the final assessment pool is 20,575
passages and is judged by the NIST assessors.
Gold response. After assessment of the passage’s relevance, We
asked the NIST assessors to write a comprehensive gold response for
the user utterances using the information from relevant passages.
The NIST assessors provided a gold response for 62 turns, selected
by the organizers based on their complexity. Each gold response is
written using a different number of relevant passages. An average
number of 21.6 passages are used for writing the gold responses. The
average length of the gold responses is approximately 104 words.
Nuggets of information. The nuggets of information are extracted
from relevant passages by NIST assessors. These nuggets serve as
a resource for the evaluation of the response generated by RAG
systems. Different from the existing approaches [31, 37] that use
a phrasal expression for the information nuggets, the nugget ex-
tracted by NIST assessors is a continuous span of the passages. In
total, 2,279 nuggets of information are extracted from 79 turns. More
statistics about our collection of nuggets are provided in Table 2. As
these nuggets are extracted from different relevant passages, there
might be duplicates of information between them. For example,
however, the following two nuggets "Snake plants can survive with
infrequent watering" and "Snake plants (Sansevieria) are highly
drought-tolerant" are extracted from different passages, but they
carry the same information and are considered duplicates. We re-
move the duplicated nuggets and create a smaller set of nuggets.
After removing the duplicate nuggets, we have 1,201 nuggets. We
release both the original nuggets annotated by NIST assessors and
the de-duplicated version of them.
Nugget entailment labels. Our goal is to benchmark the nugget
entailment and matching. In particular, given a generated response,
and an extracted information nugget, the task is to assess if the
information nugget is entailed in the generated passage, i.e., is there
a match? See Section 5.2 for more details. We release a total of 1,356
nugget entailment labels, collected via crowdsourcing on Amazing
Mechanical Turk (MTurk). This resource will enable the researchers
to develop and evaluate the entailment models for nugget-based
RAG evaluation.

Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
Abbasiantaeb et al
PTKB statement relevance assessment. The relevance of each
PTKB statement to each conversational turn is assessed by the orga-
nizers during the topic development, as well as the NIST assessors.
The PTKB statements are classified as relevant or irrelevant. The or-
ganizers assessed the relevance of PTKB statements for all 218 turns
in the dataset. While the NIST assessors only judged the relevance
of PTKB statements on the subset of turns that are selected for
passage relevance assessment. We release both sets of assessments.
3.2
Automatic Resources
LLM-based relevance assessment pool. We create a pool of
32,999 query–passage pairs by selecting the top 30 passages re-
turned by participating runs. We use the GPT4o model and the
code released by [2] to assess the relevance of passages in the pool.
The pool has 17,130, 8,771, 4,393, 1,993, and 712 passages with
relevance scores of 0, 1, 2, 3, and 4, respectively. We release this
automatically judge pool as a resource.
LLM-extracted passage nuggets. We use our nugget extraction
model described in Section 4.1 to extract the nuggets of information
from relevant passages in the assessment pool, i.e., repeating what
the NIST assessors did in extracting nuggets. Our model extracted
in total of 6,680 nuggets for 79 turns while the human annotators
extracted 2,279 nuggets from the same set of passages. For more
statistics on the nuggets by LLM see Table 2.
LLM-extracted response nuggets. After extracting the passage
nuggets in the previous step, we use the same model to extract
nuggets from generated responses (submitted by iKAT participants)
and release them as a resource. It includes the nuggets by LLM for
26 different responses over 79 conversational turns.
3.3
RAG Evaluation Pipeline.
We release the code for our RAG evaluation pipeline (CONE-RAG)
as a resource. The researchers can use CONE-RAG to evaluate the
quality of the generated response. The released pipeline takes the
generated responses for each user utterance as input in a JSON file
and reports the following metrics:
(1) Total nugget precision and recall based on four different sets
of gold nuggets namely, human nuggets, de-duplicated human
nuggets, LLM nuggets, and de-duplicated LLM nuggets.
(2) Nugget precision and recall over different turns from the dataset.
(3) Rank of the submitted run compared to the baseline and partic-
ipants’ runs.
(4) List of nuggets extracted by the input submission, with a label
indicating whether it is matched with gold nuggets or not.
3.4
Participant Resources
As an additional resource, we provide several submission runs, to-
gether with baselines for a total of 41 runs from 8 teams (including
Organizer). The existing runs belong to three different categories in-
cluding automatic (28 runs), manual (10 runs), and generation-only
(3 runs). In the manual runs, the retrieval and response generation
models use both or one of the resolved_utterance and relevant
PTKB statements provided in the dataset. The generation-only runs
only provide the output of the response generation and use the
Table 1: Statistics of Test Retrieval Data
Topics
17
Turns
218
PTKB statements
288
Assessed topics
14
Assessed turns
116
Avg. dialogue length
12.82
Avg. response length
95.29
Avg. PTKB length
16.94
Passages assessed
20,575
Fails to meet (0)
10,680
Slightly meets (1)
4,246
Moderately meets (2)
4,325
Highly meets (3)
1,199
Fully meets (4)
125
PTKB turns assessed by NIST
114
PTKB assessments by NIST
1,917
Relevant (1)
201
PTKB turns assessed by the organizers
218
PTKB assessments by the organizers
3,660
Relevant (1)
175
Table 2: Statistics of Answer Generation Data
Assessed Turns
79
Human
Number of Nuggets
2,279
Average number of nugget per turns
28.84
Average length of nuggets (word)
32.36
Number of Nuggets after removing duplicates
1,201
Human-generated gold answer
62
Average length of gold answers
21.6
LLM
Number of Nuggets
6,680
Average number of nugget per turns
84.55
Average length of nuggets (word)
13.66
Number of Nuggets after removing duplicates
3,760
provided ranking list of documents for response generation. 25 of
the run contain both retrieval and generation and 16 only retrieval.6
Retrieval. For retrieval, most runs used a multi-step pipeline con-
sisting of the following: (1) PTKB statement relevance prediction; (2)
conversational rewriting (most incorporating the previous canoni-
cal responses as well as predicted relevance PTKB statements) and
conversational query expansion; (3) retrieval using traditional lex-
ical or neural IR models; and (4) multi-stage passage re-ranking
with neural language models.
Response generation. For generation, the runs mostly relied on a
RAG pipeline, using retrieved passages from the previous step with
the conversation history or rewrite. Then, a diverse set of LLMs
were used: Llama 8B and 70B, GPT4, GPT4o, and Gemini-1.5-flash.
6A detailed description of the runs is provided in the overview of TREC iKAT 2024 [6].

Conversational Gold: Evaluating Personalized Conversational Search System using Gold Nuggets
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
Table 3: The prompt designed for nugget extraction model.
# Instruction: I will give you a user query and a text to the user
query. You should extract the nuggets of information related to the
user query from the given text. The nuggets should be an exact
copy of a span of text from the text. Please extract the nuggets and
write each nugget in one line. If there is no nugget of information
in the given text, please only say "No nugget".
# User query: {𝑞𝑟}
# Text: {𝑡}
(Please copy exact spans from the text as nuggets)
# Nuggets:
4
RAG Evaluation Pipeline Overview
We propose CONE-RAG, a nugget-based evaluation pipeline for as-
sessment of the generated responses. The pipeline consists of two
components namely nugget extraction and matching. The nugget
extraction component extracts the nuggets of the information from
the input response. The matching component has two approaches
to match either the response or the nuggets of the response to the
gold nuggets to measure the nugget recall and precision.
4.1
Nuggets Extraction
We employ zero-shot prompting an LLM to generate the nuggets
of the information from the input text given the user question. The
generated nuggets must be spans of the input text. The prompt
we use for nugget extraction is shown in Table 3. The component
extracts a set of nuggets called N𝑃from the input text called 𝑡.
The N𝑃is a set of spans from the input text 𝑡, where each span is
representative of one nugget and is shown as 𝑛𝑝. The input text (𝑡)
can be a response (𝑅) or a passage (𝑃). We use the resolved_utterance
(𝑞𝑟) as the user query in the prompt as it is the self-contained query
of the user containing both relevant statements from PTKB and the
context of the conversation. The nugget extraction function (called
𝑁𝑢𝑔𝑔𝑒𝑡𝑖𝑧𝑒𝑟) is shown in the following equation:
N𝑃= 𝑁𝑢𝑔𝑔𝑒𝑡𝑖𝑧𝑒𝑟(𝑡,𝑞𝑟)
(1)
4.2
Nugget Matching
We employ two approaches for matching the input response with
the gold nuggets: (1) by extracting nuggets from the input response
N𝑃and matching individual nuggets with gold nuggets N𝐺; and
(2) by matching the input response 𝑅directly to the gold nuggets
N𝐺, without extracting nuggets from the input response.
Nugget to Nugget (NtN) This approach assesses whether the ex-
tracted nugget entails a gold nugget or not. Given a set of extracted
nuggets N𝑃and a set of gold nuggets N𝐺, we compute entailment
scores between all nugget pairs using the following function:
𝑠= NtN(𝑛𝑝,𝑛𝑔)
(2)
where 𝑛𝑝and 𝑛𝑞are a single extracted nugget and a single gold
nugget, respectively. 𝑠= 1 if the extracted nugget 𝑛𝑝entails the
gold nugget 𝑛𝑔, and 0 otherwise. The matching process consists
of iterating over all (𝑛𝑝,𝑛𝑔) pairs of extracted nuggets and gold
nuggets to determine coverage. We use a Natural Language Infer-
ence (NLI) model to implement the NtN function. This allows us to
Table 4: The prompt designed for NtR matching.
# Instruction: I will provide you with a response and a gold
information piece. Your task is to determine whether the response
captures this piece of information or not.
# Gold Information: {𝑛𝑔}
# Response: {𝑅}
# Please answer the following:
Does the Response capture the Gold Information? Only respond
with “yes” or “no” without further explanation.
# Answer (yes/no):
determine the subset of covered gold nuggets N′
𝐺⊆N𝐺, and the
subset of input response nuggets that covers them N′
𝑃⊆N𝑃. As a
result, we can compute both nugget recall and precision.
Nugget to Response (NtR) In this approach, we directly evaluate
whether the generated response entails a gold nugget. Instead of
comparing individual nuggets, the matching model is prompted
with a response 𝑅and a gold nugget 𝑛𝑔, and predicts a binary
outcome. This method enables an assessment of whether a response
sufficiently covers the gold nuggets without requiring extracting
nuggets of the system input response. We use the prompt shown in
Table 4 for matching the input response 𝑅with the gold nugget 𝑛𝑔.
Note that we prompt the LLM with each gold nugget individually
to break down the task into small units.
𝑠= NtR(𝑅,𝑛𝑔)
(3)
We use zero-shot prompting on LLM to implement function NtR.
This allows us to determine the subset of covered gold nuggets
N′
𝐺⊆N𝐺as the set of gold nuggets for which 𝑠= 1. Using this
approach we can only compute the nugget recall.
4.3
Nugget Duplicate Removal
To ensure a more concise and non-redundant set of nuggets for
evaluation, we perform de-duplication by removing gold nuggets (or
extracted nuggets) that can be entailed by another nugget within
the same set. This process is applied separately to both human-
annotated gold nuggets and LLM-generated gold nuggets. For a
given set of gold nuggets N𝐺, we filter out any nugget 𝑛𝑔∈N𝐺if
there exists another nugget 𝑛′𝑔∈N𝐺\ {𝑛𝑔} such that 𝐸(𝑛′𝑔,𝑛𝑔) = 1.
𝐸(𝑛′𝑔,𝑛𝑔) indicates that 𝑛′𝑔entails 𝑛𝑔. Any nugget that is fully cov-
ered by another is removed, ensuring that only the most informa-
tive and distinct nuggets remain. By eliminating redundant nuggets
within each source, this deduplication step refines the evaluation
set and prevents overestimation of recall.
5
Evaluation
In this section, we explain our evaluation method for three main
tasks of TREC iKAT 2024 including response generation, passage
retrieval, and PTKB statement classification.
5.1
Experimental Setup
Our framework uses different models in the nugget extraction
and nugget matching components. For the nuggets extraction,

Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
Abbasiantaeb et al
Table 5: Correlation between ranking of responses using gold
nuggets (N𝐺) extracted by LLM and human for NtN match-
ing approach. The nugget set after removing duplicates are
shown with [D]. The correlation is reported using Kendall’s
𝜏and Spearman’s 𝜌metrics.
N𝐺
N𝐺
PrecisionNtN
RecallNtN
𝜏
𝜌
𝜏
𝜌
Human
Human [D]
0.591
0.737
0.848
0.956
LLM
LLM [D]
0.731
0.882
0.801
0.918
Human
LLM
0.649
0.814
0.731
0.889
Human [D]
LLM [D]
0.649
0.853
0.661
0.832
Human
LLM [D]
0.637
0.772
0.626
0.805
Human [D]
LLM
0.567
0.746
0.719
0.879
we rely on GPT4o [26] as the base LLM that we prompt in zero-
shot. For the NtN matching part, we employ a NLI DeBERTa [19]
model, MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli. For
the NtR, we compare both using LLMs and NLI models, compar-
ing GPT4o and DeBERTa. Additionally, duplicate removal is also
made with the same NLI DeBERTa model. We rely on the BERGEN
GitHub for computing the surface base metrics.7
5.2
Proposed RAG Evaluation Pipeline
Nugget extraction. We assess the performance of this component
in two ways.
(1) First, we employ the nugget extraction model on the same set
of relevant passages that are used for extracting nuggets by
human. In this way, we create a new set of gold nuggets by
LLM. We compare the extracted nuggets by LLM with the gold
nuggets annotated by human.
(2) We apply the matching approaches using two different sources
of gold nuggets including (1) gold nuggets from humans and
(2) gold nuggets extracted by LLM (see Section 3). We rank the
participating runs based on their performance using each set of
gold nuggets. We compute the correlation between these two
rankings.
Nuggets matching. We run a human study to do the NtR matching
task on Amazon Mechanical Turk (MTurk). We give a response with
a gold nugget to the human annotators and ask them to answer the
following question by selecting one of the “Yes" or “No" choices.
“Is the gold nugget of information covered in the response?” We
designed a comprehensive guideline for the task and to ensure the
quality of annotations 1) we put a test question in a random location
in the study and discarded the annotations of the users who gave
a wrong answer to the test question, 2) limited the annotators to
those who have more than 98% approval rate, have successfully
completed more than 5,000 Hits, and are from UK, US, or Australia.
We randomly selected 25 turns from the 79 turns with manual
nuggets and picked the best run for each participating team based
on the passage retrieval performance. In total, we had 1,356 pairs of
response–nugget pairs and divided them into batches 136 batches
where each batch includes 10 response–nugget pairs. We included
one additional test question in each batch of data and assigned
7https://github.com/naver/bergen
Table 6: Agreement between human and NtR matching.
Model
Accuracy
Cohen’s 𝜅
DeBERTa
0.805
0.247
GPT4o
0.90
0.610
each batch to 3 different annotators. We select the choice with a
majority vote between 3 annotators as the final answer. We call the
set of nuggets from gold nuggets matched by the input response
by human N′
𝑃[𝐻]. We compute the agreement between human
assessors and our NtR model by comparing N′
𝑃and N′
𝑃[𝐻].
End-to-end matching. We evaluate the end-to-end performance
of our proposed model in the case of using the NtR matching model.
We use the set of N′
𝑃and N′
𝑃[𝐻] to evaluate the participating
runs 8. We rank the runs using each output and compute the rank
correlation.
5.3
Response Generation
To evaluate the quality of the generated responses by participating
teams, we use two different evaluation paradigms including (1)
surface-based metrics and (2) our proposed nugget-based RAG
evaluation pipeline.
Reference-based evaluation. We employ surface-based, semantic-
based, and LLM-based metrics and compare our gold response as
a reference with the input response. We report the metrics such
as Rouge-1, Rouge-2, Rouge-L, BEM [11], and LLMEval with both
GPT-4o [20] and SOLAR-10.7B-Instructv1.0 [22]. In addition, we
report the groundedness metric [5] which measures to what extent
the input response is grounded to the top passages by the passage
retrieval model.
Nugget-based RAG evaluation pipeline. We employ our pro-
posed RAG evaluation framework and get the set of gold nuggets
covered by the generated response. We will explain in the follow-
ing how can we calculate the recall and precision of two different
approaches for matching.
• Nugget to Nugget. We compute the nugget recall and precision
in this approach using the following equations.
RecallNtN =
|N′
𝐺|
|N𝐺| = |{𝑛𝑔∈N𝐺| ∃𝑛𝑝∈N𝑃, NtN(𝑛𝑝,𝑛𝑔) = 1}|
|N𝐺|
(4)
PrecisionNtN =
|N′
𝑃|
|N𝑃| = |{𝑛𝑝∈N𝑃| ∃𝑛𝑔∈N𝐺, NtN(𝑛𝑝,𝑛𝑔) = 1}|
|N𝑃|
(5)
The former, RecallNtN, measures the proportion of gold nuggets
that are entailed by at least one extracted nugget. The latter,
PrecisionNtN, measures the proportion of extracted nuggets that
entail at least one gold nugget.
• Nugget to response. We define the nugget recall in this approach
as follows.
RecallNtR =
|N′′
𝐺|
|N𝐺| = |{𝑛𝑔∈N𝐺| NtR(𝑅,𝑛𝑔) = 1}|
|N𝐺|
(6)
8Note that the gold nugget by human is used for the NtR

Conversational Gold: Evaluating Personalized Conversational Search System using Gold Nuggets
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
Table 7: Rank correlation between systems using NtN and
NtR matching models.
Metric
N𝐺
PrecisionNtN
RecallNtN
𝜏
𝜌
𝜏
𝜌
RecallNtR
Human
0.614
0.786
0.778
0.923
LLM
0.626
0.781
0.778
0.925
This metric evaluates recall by considering whether the full re-
sponse directly supports the gold nuggets, independent of explicit
nugget extraction.
5.4
Passage Ranking
We assess the submitted runs using different cutt_offs over metrics
like precision, recall, and nDCG. The primary evaluation metric is
mean nDCG@5. We compute the average over all conversational
turns with the same weight.
6
Results & Analysis
6.1
RAG evaluation: Comparison with human
performance
LLM-based nugget extraction. Human extracted 2,279 nuggets
while our LLM-based model extracted 6,680 nuggets. The aver-
age length of the nuggets by humans is around 32 words while
for nuggets by LLM is around 14 words. The LLM-based nugge-
tizer model tends to extract more fine-grained and specific nuggets.
Among the 6,680 nuggets extracted by LLM, 144 of them exactly
match and 4,189 of them partially match with human nuggets while
2491 of them are new nuggets. Among the nuggets that partially
match 2,443 of them have overlap over at least two words, 1,746
of them overlap over one or two words, and 2,159 of them overlap
over more than 4 words with the human nuggets.
Rank correlation using LLM and human nuggets. We evalu-
ate the participating systems using CONE-RAG and rank them. We
repeat this process using a different set of gold nuggets. In Table
5 we report the correlation between the ranking of participating
systems using different sets of gold nuggets. As can be seen, with-
out removing the duplicate nuggets, the correlation between the
ranking of systems based on nugget recall is 0.731 and 0.889 using
Kendall’s 𝜏and Spearman’s 𝜌metrics, respectively.
LLM vs. human for NtR matching. We compute the agreement
between humans and our NtR model in Table 6. The GPT4o model
performs better than DeBERTa on the NtR matching task by achiev-
ing Cohen’s 𝜅agreement of 0.61 and an Accuracy of 0.9.
Human vs. LLM for RecallNtR. We use our end-to-end frame-
work (CONE-RAG) to calculate the RecallNtR for the runs and sort
them. Also, we use the human for NtR matching and calculating
the RecallNtR. The CONE-RAG (based on GPT4o) achieves a rank
correlation of 0.867 and 0.943 compared to humans on Kendall’s 𝜏
and Spearman’s 𝜌rank correlation metrics. Using DeBERTa for NtR
matching, Kendall’s 𝜏and Spearman’s 𝜌rank correlation metrics
are 0.6 and 0.829.
0.1
0.2
0.3
0.4
0.5
nDCG@5
0.02
0.04
0.06
0.08
0.10
0.12
0.14
RecallNtN
iires-1
orga-2
orga-1
uva-4
infos-4
orga-6
ksu-1
orga-4
uva-1
rali-2
uva-2
infos-1
orga-5
nii-1
orga-3
rali-3
infos-2
infos-3
uva-3
Runs
Figure 1: Nugget Recall vs. nDCG@5 for automatic runs.
NtN vs. NtR matching. The correlation between the ranking of
systems using each method of matching in CONE-RAG is shown in
Table 7. As can be seen, we achieve Kendall’s 𝜏and Spearman’s 𝜌
rank correlation of 0.778 and 0.923 when we use these matching
models and sort the systems. Also, using the original set of gold
nuggets by either LLM or humans, the correlation is approximately
equal. This observation shows the robustness of matching models
against the input set of gold nuggets.
Nugget-based evaluation vs reference-based evaluation. We
report the rank correlation between the ranking of systems using
CONE-RAG and reference-based metrics in Table 8. Surprisingly, we
observe a very low correlation with the ROUGE, although this
metric is also recall-oriented, using n-gram of predicted and hu-
man reference response. This fact highlights the stark contrast in
performance between these two evaluation paradigms where each
method focuses on distinctly different aspects. The reference-based
metrics like Rouge only consider the semantic similarity, however,
the nugget-based evaluation works by comparing the nuggets of
the information in the content of the response. Interestingly, the
correlation between nugget recall and precision with grounded-
ness is negative. A response can have a high nugget recall while
the nuggets of information in the response are from the intrinsic
knowledge of the response generation model and not from the
input passages. The CONE-RAG has a higher rank correlation with
LLMEval and BEM metrics.
Original nuggets vs. de-duplicated nuggets. The correlation
between the ranking of systems using original gold nuggets and
after removing duplicate nuggets is shown in Table 5. Removing the
duplicate nuggets results in a higher decrease in correlation in over
precision than recall. For example, after removing the duplicates
from the original human gold nuggets, Kendall’s Tau correlation is
0.591 and 0.848 over precision, and recall, respectively.
Comparison between response generation performance of
participating teams. We report the performance of participating
teams in Table 9. Using the NtR matching model, the value of
recall is much higher compared to the NtN matching model. This
observation represents that the NtR matching model matches the
response to a higher number of gold nuggets while the NtN model
matches the lower number of extracted nuggets from the response
to the gold nuggets. Using the LLM or human nuggets as gold
nuggets, we observe a lower difference in the value of recall over

Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
Abbasiantaeb et al
Table 8: The value of rank correlation between using CONE-RAG and different reference-based evaluation metrics.
Matching
N𝐺
Rouge-1
Rouge-2
Rouge-L
LLMeval
BEM
Groundedness
SOLAR
GPT4o
𝜏
𝜌
𝜏
𝜌
𝜏
𝜌
𝜏
𝜌
𝜏
𝜌
𝜏
𝜌
𝜏
𝜌
RecallNtN
LLM
0.147
0.184
0.074
0.074
0.088
0.113
0.574
0.735
0.735
0.9
0.574
0.75
-0.397
-0.525
Human
-0.088
-0.11
-0.103
-0.123
-0.088
-0.13
0.603
0.775
0.441
0.6
0.691
0.86
-0.25
-0.277
RecallNtR
Human
-0.1
-0.088
-0.033
-0.103
-0.183
-0.165
0.717
0.85
0.633
0.829
0.567
0.724
-0.333
-0.353
LLM
-0.02
-0.007
-0.033
-0.026
-0.098
-0.067
0.725
0.874
0.66
0.822
0.569
0.773
-0.399
-0.476
Table 9: Response generation performance of the participants over different metrics.
Run name
NtN
NtR
Human
LLM
Human
LLM
Rouge-L
LLMeval
BEM
Groundedness
Precision
Recall
Precision
Recall
Recall
Recall
SOLAR
GPT-4o
Automatic
uva-3
0.454
0.138
0.504
0.120
0.368
0.379
0.191
0.952
0.629
0.252
0.387
orga-3
0.470
0.128
0.532
0.121
0.368
0.379
0.200
0.984
0.629
0.283
0.468
uva-2
0.458
0.125
0.565
0.141
0.339
0.390
0.199
0.935
0.79
0.265
0.435
uva-1
0.461
0.111
0.568
0.146
0.367
0.364
0.199
0.935
0.71
0.269
0.355
uva-4
0.449
0.110
0.543
0.130
0.395
0.425
0.197
0.984
0.758
0.272
0.339
nii-1
0.460
0.110
0.540
0.129
0.364
0.391
0.202
0.952
0.677
0.263
0.871
orga-5
0.424
0.108
0.504
0.115
0.342
0.354
0.197
0.968
0.710
0.287
0.548
orga-6
0.433
0.107
0.496
0.123
0.346
0.357
0.193
0.919
0.645
0.253
0.484
orga-1
0.457
0.104
0.465
0.098
0.358
0.325
0.184
0.823
0.548
0.267
0.710
rali-3
0.423
0.095
0.532
0.117
0.276
0.332
0.214
0.887
0.710
0.235
0.613
orga-4
0.436
0.092
0.507
0.107
0.301
0.333
0.195
0.935
0.597
0.267
0.532
rali-2
0.422
0.088
0.493
0.115
0.246
0.313
0.222
0.919
0.645
0.246
0.565
orga-2
0.378
0.076
0.444
0.101
0.273
0.244
0.198
0.613
0.419
0.209
0.677
infos-2
0.443
0.067
0.540
0.107
0.149
0.180
0.237
0.803
0.645
0.253
0.097
infos-4
0.430
0.066
0.480
0.058
0.163
0.176
0.218
0.629
0.403
0.213
0.565
infos-1
0.378
0.061
0.450
0.068
0.155
0.169
0.228
0.787
0.581
0.225
0.290
infos-3
0.417
0.059
0.434
0.062
0.171
0.184
0.217
0.661
0.306
0.227
0.645
iires-1
0.378
0.022
0.325
0.020
0.048
0.048
0.091
0.016
0.032
0.122
0.855
ksu-1
0.293
0.018
0.178
0.007
0.043
0.041
0.143
0.065
0.065
0.148
0.750
Manual
orga-8-m
0.482
0.115
0.525
0.124
0.366
0.337
0.195
0.967
0.661
0.268
0.516
uva-6-m
0.512
0.149
0.556
0.144
0.393
0.413
0.198
0.983
0.790
0.283
0.419
orga-7-m
0.477
0.126
0.539
0.143
0.383
0.393
0.199
0.935
0.709
0.266
0.435
uva-5-m
0.470
0.127
0.563
0.141
0.404
0.404
0.195
1.000
0.725
0.247
0.435
Generation-only
nii-gen-only
0.482
0.051
0.468
0.062
0.237
0.258
0.174
0.580
0.435
0.203
0.919
different teams. However, the difference in the values of precision
for each team is a bit higher compared to recall. This is in line
with the rank correlation in Table 5 where the rank correlation
when using LLM or human gold nuggets is higher over the recall
metric than the precision metric. Interestingly, the runs with higher
nugget recall and precision have a lower groundedness value while
teams with lower nugget recall and precision have a higher value of
groundedness. For example, the “iires-1” team has a nugget recall
value of 0.018 while it has a groundedness of 0.855. While the “uva-3”
team has a nugget recall of 0.138 with a groundedness of 0.387.
Performance per personal turns. We classify the turns as per-
sonal and non-personal where the turns with at least one relevant
PTKB statement are considered as personal turns. Looking at Ta-
ble 10, we observe that nugget precision and recall over personal
turns is generally lower compared to non-personal turns.
6.2
Retrieval performance evaluation
Overall results. We show the retrieval performance of automatic
and manual runs in Table 11. We do not observe a very large gap be-
tween the performance of manual and automatic runs which shows

Conversational Gold: Evaluating Personalized Conversational Search System using Gold Nuggets
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
Table 10: Average RecallNtN and PrecisionNtN of automatic
runs over personal and non personal turns.
Metric
N𝐺
Personal
Non Personal
RecallNtN
Human
0.064
0.115
Human [D]
0.021
0.049
LLM
0.066
0.135
LLM [D]
0.03
0.06
PrecisionNtN
Human
0.415
0.434
Human [D]
0.041
0.081
LLM
0.425
0.537
LLM [D]
0.152
0.201
Table 11: Automatic evaluation of passage retrieval results.
Evaluation at retrieval cutoff of 1000.
Run
nDCG@5
P@20
R@20
R@1000
mAP
Automatic
uva-4
0.494
0.596
0.176
0.783
0.355
rali-2
0.484
0.478
0.148
0.580
0.198
uva-2
0.481
0.549
0.166
0.644
0.297
uva-1
0.488
0.583
0.172
0.774
0.334
rali-3
0.471
0.471
0.140
0.564
0.189
infos-2
0.466
0.516
0.148
0.682
0.256
uva-3
0.461
0.506
0.153
0.486
0.219
infos-1
0.445
0.472
0.141
0.639
0.235
nii-1
0.405
0.437
0.138
0.598
0.218
orga-3
0.385
0.381
0.131
0.427
0.176
orga-6
0.377
0.416
0.126
0.420
0.171
orga-5
0.374
0.397
0.134
0.603
0.213
rali-1
0.367
0.387
0.121
0.580
0.170
nii-2
0.366
0.394
0.119
0.519
0.197
nii-3
0.365
0.392
0.119
0.519
0.197
infos-4
0.353
0.401
0.114
0.570
0.190
rali-4
0.347
0.375
0.112
0.564
0.161
orga-2
0.33
0.404
0.119
0.563
0.202
infos-3
0.328
0.353
0.099
0.529
0.153
orga-4
0.240
0.253
0.081
0.304
0.106
orga-1
0.205
0.234
0.065
0.263
0.089
ksu-1
0.164
0.071
0.022
0.022
0.017
dcu-4
0.150
0.154
0.042
0.185
0.055
Manual
uva-6-m
0.529
0.594
0.197
0.706
0.349
uva-5-m
0.473
0.547
0.183
0.706
0.305
nii-5-m
0.455
0.535
0.166
0.596
0.255
nii-4-m
0.454
0.535
0.165
0.596
0.256
orga-7-m
0.414
0.46
0.149
0.706
0.245
orga-8-m
0.413
0.435
0.143
0.460
0.19
rali-6-m
0.403
0.365
0.113
0.460
0.129
rali-5-m
0.396
0.365
0.113
0.460
0.132
dcu-5-m
0.226
0.222
0.080
0.221
0.080
dcu-6-m
0.207
0.211
0.073
0.217
0.072
the advancement of LLMs for context modeling and understanding
the information need of the user in the context of the conversation.
Table 12: Correlation between the ranking of systems based
on passage retrieval (measured by nDCG@5) and response
generation performance.
Gold nuggets (N𝐺)
RecallNtN
PrecisionNtN
𝜏
𝜌
𝜏
𝜌
Human
0.404
0.563
0.310
0.46
Human [D]
0.415
0.602
0.392
0.574
LLM
0.556
0.744
0.614
0.763
LLM[D]
0.661
0.837
0.556
0.725
For example, the best manual and automatic teams (i.e., “uva-4” and
“uva-6-m”) have a 0.035 difference in terms of nDCG@5 metric.
Correlation between passage retrieval and response genera-
tion performance. As can be seen in Table 12, the correlation be-
tween response generation and passage retrieval is higher when us-
ing gold nuggets from LLM compared to using nuggets by humans.
In addition, by removing duplicates from human gold nuggets, the
correlation increases over both nugget precision and nugget recall
metrics. However, removing duplicates from LLM gold nugget, the
correlation increases only in terms the nugget recall metric. We
do not observe a high correlation between nugget recall/precision
and ndcg@5. Based on this observation, we cannot say that better
performance in passage retrieval guarantees a higher nugget re-
call/precision on response generation. This gap between retrieval
performance and response generation performance can be due to
multiple reasons such as the focus of retrieval models on one as-
pect or the usage of intrinsic knowledge of response generation
models. However, further in-depth analysis is required to deter-
mine the main reason. We show the nugget recall of the automatic
runs (based on the NtN matching approach) vs the nDCG@5 in
Figure 1. The “uva-X“ runs have the highest retrieval performance
and nugget recall. These runs are based on the MQ4CS model [1]
which breaks the information need of the user into multi-aspect
queries. This observation indicates the effectiveness of leveraging
multi-aspect queries for generating a more complete answer that
covers a more diverse set of information. In addition, these runs
used the GPT-4 model for response generation.
7
Conclusion
We introduce the iKAT 2024 collection and resources, building
on the foundations established by TREC iKAT 2024. These re-
sources empower researchers to evaluate personalized conversa-
tional search agents (CSA) across both passage retrieval and re-
sponse generation tasks. A key innovation of our work is CONE-RAG,
a nugget-based evaluation pipeline for retrieval-augmented gener-
ation (RAG), along with an assessment pool for passage retrieval.
The CONE-RAG enables the assessment of generated responses using
nugget recall and precision metrics. Extensive experiments com-
pare the framework’s effectiveness against human performance,
demonstrating its robustness and reliability. We also provide the
gold nuggets and matching of nuggets with responses which can be
used for the development of nugget extraction and nugget matching
models. In addition, we propose the gold responses for each con-
versational turn as a resource which is essential for surface-based
evaluation.

Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
Abbasiantaeb et al
References
[1] Zahra Abbasiantaeb and Mohammad Aliannejadi. 2024. Generate then Retrieve:
Conversational Response Retrieval Using LLMs as Answer and Query Generators.
arXiv:2403.19302 [cs.IR]
[2] Zahra Abbasiantaeb, Chuan Meng, Leif Azzopardi, and Mohammad Aliannejadi.
2024. Can We Use Large Language Models to Fill Relevance Judgment Holes?
arXiv:2405.05600 [cs.IR] https://arxiv.org/abs/2405.05600
[3] Zahra Abbasiantaeb, Yifei Yuan, Evangelos Kanoulas, and Mohammad Alian-
nejadi. 2024. Let the LLMs Talk: Simulating Human-to-Human Conversational
QA via Zero-Shot LLM-to-LLM Interactions. In International Conference on Web
Search and Data Mining (WSDM). ACM, 8–17.
[4] Marwah Alaofi, Negar Arabzadeh, Charles LA Clarke, and Mark Sanderson. 2024.
Generative information retrieval evaluation. In Information Access in the Era of
Generative AI. Springer, 135–159.
[5] Mohammad Aliannejadi, Zahra Abbasiantaeb, Shubham Chatterjee, Jeffery Dal-
ton, and Leif Azzopardi. 2024. TREC iKAT 2023: The Interactive Knowledge
Assistance Track Overview. In Text REtrieval Conference (TREC). NIST.
[6] Mohammad Aliannejadi, Zahra Abbasiantaeb, Simon Lupart, Shubham Chatterjee,
Jeffery Dalton, and Leif Azzopardi. 2025. TREC iKAT 2025: The Interactive
Knowledge Assistance Track Overview. In Text REtrieval Conference (TREC).
NIST.
[7] Mohammad Aliannejadi, Hamed Zamani, Fabio Crestani, and W. Bruce Croft.
2019. Asking Clarifying Questions in Open-Domain Information-Seeking Con-
versations. In SIGIR. ACM, 475–484.
[8] James Allan, Eunsol Choi, Daniel P Lopresti, and Hamed Zamani. 2024. Future
of Information Retrieval Research in the Age of Generative AI. arXiv preprint
arXiv:2412.02043 (2024).
[9] Leif Azzopardi, Mateusz Dubiel, Martin Halvey, and Jeffery Dalton. 2018. Con-
ceptualizing agent-human interactions during the conversational search process.
In The second international workshop on conversational approaches to information
retrieval.
[10] Nicholas J. Belkin. 2008. Some(what) grand challenges for information retrieval.
SIGIR Forum 42, 1 (jun 2008), 47–54. https://doi.org/10.1145/1394251.1394261
[11] Jannis Bulian, Christian Buck, Wojciech Gajewski, Benjamin Boerschinger, and
Tal Schuster. 2022. Tomayto, tomahto. beyond token-level answer equivalence
for question answering evaluation. arXiv preprint arXiv:2202.07654 (2022).
[12] Jeffrey Dalton, Chenyan Xiong, and Jamie Callan. 2020. CAsT 2019: The conver-
sational assistance track overview. In Text REtrieval Conference (TREC). NIST.
[13] Jeffrey Dalton, Chenyan Xiong, and Jamie Callan. 2021. TREC CAsT 2021: The
Conversational Assistance Track Overview. In Text REtrieval Conference (TREC).
NIST.
[14] Jeffrey Dalton, Chenyan Xiong, Vaibhav Kumar, and Jamie Callan. 2020. CAsT-
19: A Dataset for Conversational Information Seeking. In International ACM
SIGIR Conference on Research and Development in Information Retrieval. ACM,
1985–1988.
[15] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:
Pre-training of Deep Bidirectional Transformers for Language Understanding.
arXiv:1810.04805 [cs.CL]
[16] Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. 2024. RAGAs:
Automated Evaluation of Retrieval Augmented Generation. In Proceedings of the
18th Conference of the European Chapter of the Association for Computational
Linguistics: System Demonstrations, Nikolaos Aletras and Orphee De Clercq (Eds.).
Association for Computational Linguistics, St. Julians, Malta, 150–158. https:
//aclanthology.org/2024.eacl-demo.16/
[17] Guglielmo Faggioli, Laura Dietz, Charles LA Clarke, Gianluca Demartini, Matthias
Hagen, Claudia Hauff, Noriko Kando, Evangelos Kanoulas, Martin Potthast,
Benno Stein, et al. 2023. Perspectives on large language models for relevance
judgment. In Proceedings of the 2023 ACM SIGIR International Conference on
Theory of Information Retrieval. 39–50.
[18] George Giannakopoulos and Vangelis Karkaletsis. 2013. Summary evaluation:
Together we stand npower-ed. In International Conference on Intelligent Text
Processing and Computational Linguistics. Springer, 436–450.
[19] Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021. DEBERTA:
DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION. In Inter-
national Conference on Learning Representations. https://openreview.net/forum?
id=XPZIaotutsD
[20] Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh,
Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024.
Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024).
[21] Ehsan Kamalloo, Nouha Dziri, Charles Clarke, and Davood Rafiei. 2023. Evaluat-
ing Open-Domain Question Answering in the Era of Large Language Models. In
Proceedings of the 61st Annual Meeting of the Association for Computational Lin-
guistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki
Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada,
5591–5606. https://doi.org/10.18653/v1/2023.acl-long.307
[22] Dahyun Kim, Chanjun Park, Sanghoon Kim, Wonsung Lee, Wonho Song, Yunsu
Kim, Hyeonwoo Kim, Yungi Kim, Hyeonju Lee, Jihoo Kim, et al. 2023. Solar
10.7 b: Scaling large language models with simple yet effective depth up-scaling.
arXiv preprint arXiv:2312.15166 (2023).
[23] Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries.
In Text Summarization Branches Out. Association for Computational Linguistics,
Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013/
[24] Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep,
and Rodrigo Nogueira. 2021. Pyserini: A Python Toolkit for Reproducible In-
formation Retrieval Research with Sparse and Dense Representations. In Pro-
ceedings of the 44th International ACM SIGIR Conference on Research and De-
velopment in Information Retrieval (Virtual Event, Canada) (SIGIR ’21). Asso-
ciation for Computing Machinery, New York, NY, USA, 2356–2362.
https:
//doi.org/10.1145/3404835.3463238
[25] James Mayfield, Eugene Yang, Dawn Lawrie, Sean MacAvaney, Paul McNamee,
Douglas W Oard, Luca Soldaini, Ian Soboroff, Orion Weller, Efsun Kayi, et al.
2024. On the evaluation of machine-generated reports. In Proceedings of the 47th
International ACM SIGIR Conference on Research and Development in Information
Retrieval. 1904–1915.
[26] OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya
Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford,
Aleksander Mądry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Car-
ney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard
Passos, Alexander Kirillov, Alexi Christakis, Alexis Conneau, Ali Kamali, Allan
Jabri, Allison Moyer, Allison Tam, Amadou Crookes, Amin Tootoochian, Amin
Tootoonchian, Ananya Kumar, Andrea Vallone, Andrej Karpathy, Andrew Braun-
stein, Andrew Cann, Andrew Codispoti, Andrew Galu, Andrew Kondrich, An-
drew Tulloch, Andrey Mishchenko, Angela Baek, Angela Jiang, Antoine Pelisse,
Antonia Woodford, Anuj Gosalia, Arka Dhar, Ashley Pantuliano, Avi Nayak,
Avital Oliver, Barret Zoph, Behrooz Ghorbani, Ben Leimberger, Ben Rossen, Ben
Sokolowsky, Ben Wang, Benjamin Zweig, Beth Hoover, Blake Samic, Bob Mc-
Grew, Bobby Spero, Bogo Giertler, Bowen Cheng, Brad Lightcap, Brandon Walkin,
Brendan Quinn, Brian Guarraci, Brian Hsu, Bright Kellogg, Brydon Eastman,
Camillo Lugaresi, Carroll Wainwright, Cary Bassin, Cary Hudson, Casey Chu,
Chad Nelson, Chak Li, Chan Jun Shern, Channing Conger, Charlotte Barette,
Chelsea Voss, Chen Ding, Cheng Lu, Chong Zhang, Chris Beaumont, Chris
Hallacy, Chris Koch, Christian Gibson, Christina Kim, Christine Choi, Christine
McLeavey, Christopher Hesse, Claudia Fischer, Clemens Winter, Coley Czarnecki,
Colin Jarvis, Colin Wei, Constantin Koumouzelis, Dane Sherburn, Daniel Kappler,
Daniel Levin, Daniel Levy, David Carr, David Farhi, David Mely, David Robinson,
David Sasaki, Denny Jin, Dev Valladares, Dimitris Tsipras, Doug Li, Duc Phong
Nguyen, Duncan Findlay, Edede Oiwoh, Edmund Wong, Ehsan Asdar, Elizabeth
Proehl, Elizabeth Yang, Eric Antonow, Eric Kramer, Eric Peterson, Eric Sigler,
Eric Wallace, Eugene Brevdo, Evan Mays, Farzad Khorasani, Felipe Petroski Such,
Filippo Raso, Francis Zhang, Fred von Lohmann, Freddie Sulit, Gabriel Goh, Gene
Oden, Geoff Salmon, Giulio Starace, Greg Brockman, Hadi Salman, Haiming Bao,
Haitang Hu, Hannah Wong, Haoyu Wang, Heather Schmidt, Heather Whitney,
Heewoo Jun, Hendrik Kirchner, Henrique Ponde de Oliveira Pinto, Hongyu Ren,
Huiwen Chang, Hyung Won Chung, Ian Kivlichan, Ian O’Connell, Ian O’Connell,
Ian Osband, Ian Silber, Ian Sohl, Ibrahim Okuyucu, Ikai Lan, Ilya Kostrikov, Ilya
Sutskever, Ingmar Kanitscheider, Ishaan Gulrajani, Jacob Coxon, Jacob Menick,
Jakub Pachocki, James Aung, James Betker, James Crooks, James Lennon, Jamie
Kiros, Jan Leike, Jane Park, Jason Kwon, Jason Phang, Jason Teplitz, Jason Wei,
Jason Wolfe, Jay Chen, Jeff Harris, Jenia Varavva, Jessica Gan Lee, Jessica Shieh,
Ji Lin, Jiahui Yu, Jiayi Weng, Jie Tang, Jieqi Yu, Joanne Jang, Joaquin Quinonero
Candela, Joe Beutler, Joe Landers, Joel Parish, Johannes Heidecke, John Schul-
man, Jonathan Lachman, Jonathan McKay, Jonathan Uesato, Jonathan Ward,
Jong Wook Kim, Joost Huizinga, Jordan Sitkin, Jos Kraaijeveld, Josh Gross, Josh
Kaplan, Josh Snyder, Joshua Achiam, Joy Jiao, Joyce Lee, Juntang Zhuang, Justyn
Harriman, Kai Fricke, Kai Hayashi, Karan Singhal, Katy Shi, Kavin Karthik, Kayla
Wood, Kendra Rimbach, Kenny Hsu, Kenny Nguyen, Keren Gu-Lemberg, Kevin
Button, Kevin Liu, Kiel Howe, Krithika Muthukumar, Kyle Luther, Lama Ahmad,
Larry Kai, Lauren Itow, Lauren Workman, Leher Pathak, Leo Chen, Li Jing, Lia
Guy, Liam Fedus, Liang Zhou, Lien Mamitsuka, Lilian Weng, Lindsay McCallum,
Lindsey Held, Long Ouyang, Louis Feuvrier, Lu Zhang, Lukas Kondraciuk, Lukasz
Kaiser, Luke Hewitt, Luke Metz, Lyric Doshi, Mada Aflak, Maddie Simens, Made-
laine Boyd, Madeleine Thompson, Marat Dukhan, Mark Chen, Mark Gray, Mark
Hudnall, Marvin Zhang, Marwan Aljubeh, Mateusz Litwin, Matthew Zeng, Max
Johnson, Maya Shetty, Mayank Gupta, Meghan Shah, Mehmet Yatbaz, Meng Jia
Yang, Mengchao Zhong, Mia Glaese, Mianna Chen, Michael Janner, Michael
Lampe, Michael Petrov, Michael Wu, Michele Wang, Michelle Fradin, Michelle
Pokrass, Miguel Castro, Miguel Oom Temudo de Castro, Mikhail Pavlov, Miles
Brundage, Miles Wang, Minal Khan, Mira Murati, Mo Bavarian, Molly Lin, Mu-
rat Yesildal, Nacho Soto, Natalia Gimelshein, Natalie Cone, Natalie Staudacher,
Natalie Summers, Natan LaFontaine, Neil Chowdhury, Nick Ryder, Nick Stathas,
Nick Turley, Nik Tezak, Niko Felix, Nithanth Kudige, Nitish Keskar, Noah Deutsch,
Noel Bundick, Nora Puckett, Ofir Nachum, Ola Okelola, Oleg Boiko, Oleg Murk,
Oliver Jaffe, Olivia Watkins, Olivier Godement, Owen Campbell-Moore, Patrick
Chao, Paul McMillan, Pavel Belov, Peng Su, Peter Bak, Peter Bakkum, Peter Deng,
Peter Dolan, Peter Hoeschele, Peter Welinder, Phil Tillet, Philip Pronin, Philippe

Conversational Gold: Evaluating Personalized Conversational Search System using Gold Nuggets
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
Tillet, Prafulla Dhariwal, Qiming Yuan, Rachel Dias, Rachel Lim, Rahul Arora,
Rajan Troll, Randall Lin, Rapha Gontijo Lopes, Raul Puri, Reah Miyara, Reimar
Leike, Renaud Gaubert, Reza Zamani, Ricky Wang, Rob Donnelly, Rob Honsby,
Rocky Smith, Rohan Sahai, Rohit Ramchandani, Romain Huet, Rory Carmichael,
Rowan Zellers, Roy Chen, Ruby Chen, Ruslan Nigmatullin, Ryan Cheu, Saachi
Jain, Sam Altman, Sam Schoenholz, Sam Toizer, Samuel Miserendino, Sandhini
Agarwal, Sara Culver, Scott Ethersmith, Scott Gray, Sean Grove, Sean Metzger,
Shamez Hermani, Shantanu Jain, Shengjia Zhao, Sherwin Wu, Shino Jomoto,
Shirong Wu, Shuaiqi, Xia, Sonia Phene, Spencer Papay, Srinivas Narayanan, Steve
Coffey, Steve Lee, Stewart Hall, Suchir Balaji, Tal Broda, Tal Stramer, Tao Xu,
Tarun Gogineni, Taya Christianson, Ted Sanders, Tejal Patwardhan, Thomas
Cunninghman, Thomas Degry, Thomas Dimson, Thomas Raoux, Thomas Shad-
well, Tianhao Zheng, Todd Underwood, Todor Markov, Toki Sherbakov, Tom
Rubin, Tom Stasi, Tomer Kaftan, Tristan Heywood, Troy Peterson, Tyce Wal-
ters, Tyna Eloundou, Valerie Qi, Veit Moeller, Vinnie Monaco, Vishal Kuo, Vlad
Fomenko, Wayne Chang, Weiyi Zheng, Wenda Zhou, Wesam Manassra, Will
Sheu, Wojciech Zaremba, Yash Patil, Yilei Qian, Yongjik Kim, Youlong Cheng, Yu
Zhang, Yuchen He, Yuchen Zhang, Yujia Jin, Yunxing Dai, and Yury Malkov. 2024.
GPT-4o System Card. arXiv:2410.21276 [cs.CL] https://arxiv.org/abs/2410.21276
[27] Paul Over. 2001. The TREC interactive track: an annotated bibliography. Infor-
mation Processing & Management 37, 3 (2001), 369–381.
[28] Arnold Overwijk, Chenyan Xiong, Xiao Liu, Cameron VandenBerg, and Jamie
Callan. 2022. ClueWeb22: 10 Billion Web Documents with Visual and Semantic
Information. arXiv:2211.15848 [cs.IR]
[29] Paul Owoicho, Jeffrey Dalton, Mohammad Aliannejadi, Leif Azzopardi, Johanne R
Trippas, and Svitlana Vakulenko. 2023. TREC CAsT 2022: Going Beyond User Ask
and System Retrieve with Initiative and Response Generation. In Text REtrieval
Conference (TREC). NIST.
[30] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu:
a Method for Automatic Evaluation of Machine Translation. In Proceedings of
the 40th Annual Meeting of the Association for Computational Linguistics, Pierre
Isabelle, Eugene Charniak, and Dekang Lin (Eds.). Association for Computational
Linguistics, Philadelphia, Pennsylvania, USA, 311–318. https://doi.org/10.3115/
1073083.1073135
[31] Ronak Pradeep, Nandan Thakur, Shivani Upadhyay, Daniel Campos, Nick
Craswell, and Jimmy Lin. 2024. Initial nugget evaluation results for the trec 2024
rag track with the autonuggetizer framework. arXiv preprint arXiv:2411.09607
(2024).
[32] Filip Radlinski and Nick Craswell. 2017. A theoretical framework for conversa-
tional search. In Proceedings of the 2017 Conference on Conference Human Infor-
mation Interaction and Retrieval. ACM, 117–126.
[33] Hossein A Rahmani, Clemencia Siro, Mohammad Aliannejadi, Nick Craswell,
Charles LA Clarke, Guglielmo Faggioli, Bhaskar Mitra, Paul Thomas, and Emine
Yilmaz. 2024. Llm4eval: Large language model for evaluation in ir. In Proceedings
of the 47th International ACM SIGIR Conference on Research and Development in
Information Retrieval. 3040–3043.
[34] Hossein A Rahmani, Clemencia Siro, Mohammad Aliannejadi, Nick Craswell,
Charles LA Clarke, Guglielmo Faggioli, Bhaskar Mitra, Paul Thomas, and Em-
ine Yilmaz. 2024. Report on the 1st workshop on large language model for
evaluation in information retrieval (llm4eval 2024) at sigir 2024. arXiv preprint
arXiv:2408.05388 (2024).
[35] David Rau, Hervé Déjean, Nadezhda Chirkova, Thibault Formal, Shuai Wang,
Stéphane Clinchant, and Vassilina Nikoulina. 2024. BERGEN: A Benchmarking
Library for Retrieval-Augmented Generation. In Findings of the Association for
Computational Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and
Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida,
USA, 7640–7663. https://doi.org/10.18653/v1/2024.findings-emnlp.449
[36] Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. 2024.
ARES: An Automated Evaluation Framework for Retrieval-Augmented Genera-
tion Systems. In Proceedings of the 2024 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies
(Volume 1: Long Papers), NAACL 2024, Mexico City, Mexico, June 16-21, 2024, Kevin
Duh, Helena Gómez-Adorno, and Steven Bethard (Eds.). Association for Computa-
tional Linguistics, 338–354. https://doi.org/10.18653/V1/2024.NAACL-LONG.20
[37] Chris Samarinas, Alexander Krubner, Alireza Salemi, Youngwoo Kim, and Hamed
Zamani. 2025. Beyond Factual Accuracy: Evaluating Coverage of Diverse Factual
Information in Long-form Text Generation.
arXiv:2501.03545 [cs.CL] https:
//arxiv.org/abs/2501.03545
[38] Rikiya Takehi, Akihisa Watanabe, and Tetsuya Sakai. 2023. Open-Domain Dia-
logue Quality Evaluation: Deriving Nugget-level Scores from Turn-level Scores.
In Proceedings of the Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval in the Asia Pacific Region. 40–45.
[39] Paul Thomas, Seth Spielman, Nick Craswell, and Bhaskar Mitra. 2024. Large
language models can accurately predict searcher preferences. In Proceedings of
the 47th International ACM SIGIR Conference on Research and Development in
Information Retrieval. 1930–1940.
[40] Ellen M. Voorhees. 2003. Overview of the TREC 2003 Question Answering Track.
In Proceedings of The Twelfth Text REtrieval Conference, TREC 2003, Gaithers-
burg, Maryland, USA, November 18-21, 2003 (NIST Special Publication, Vol. 500-
255), Ellen M. Voorhees and Lori P. Buckland (Eds.). National Institute of Stan-
dards and Technology (NIST), 54–68. http://trec.nist.gov/pubs/trec12/papers/
QA.OVERVIEW.pdf
[41] Grace Hui Yang and Ian Soboroff. 2016. TREC 2016 Dynamic Domain Track
Overview.. In Text REtrieval Conference (TREC). NIST.
[42] Grace Hui Yang, Zhiwen Tang, and Ian Soboroff. 2017. TREC 2017 Dynamic
Domain Track Overview. In Text REtrieval Conference (TREC). NIST.
[43] Hui Yang, John R. Frank, and Ian Soboroff. 2015. TREC 2015 Dynamic Domain
Track Overview. In Text REtrieval Conference (TREC). NIST.
[44] Hamed Zamani, Johanne R Trippas, Jeff Dalton, Filip Radlinski, et al. 2023. Con-
versational information seeking. Foundations and Trends® in Information Retrieval
17, 3-4 (2023), 244–456.
[45] Zihan Zhang, Meng Fang, and Ling Chen. 2024. RetrievalQA: Assessing Adap-
tive Retrieval-Augmented Generation for Short-form Open-Domain Question
Answering. In Findings of the Association for Computational Linguistics: ACL
2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for
Computational Linguistics, Bangkok, Thailand, 6963–6975. https://doi.org/10.
18653/v1/2024.findings-acl.415
[46] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu,
Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang,
Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-judge with MT-
bench and Chatbot Arena. In Proceedings of the 37th International Conference on
Neural Information Processing Systems (New Orleans, LA, USA) (NIPS ’23). Curran
Associates Inc., Red Hook, NY, USA, Article 2020, 29 pages.
