Abstract: ere is growing interest in systems that generate timeline summaries
by ltering high-volume streams of documents to retain
only those that are relevant to a particular event or topic. Continued
advances in algorithms and techniques for this task depend
on standardized and reproducible evaluation methodologies for
comparing systems. However, timeline summary evaluation is still
in its infancy, with competing methodologies currently being explored
in international evaluation forums such as TREC. One area
of active exploration is how to explicitly represent the units of information
that should appear in a “good” summary. Currently, there
are two main approaches, one based on identifying nuggets in an
external “ground truth”, and the other based on clustering system
outputs. In this paper, by building test collections that have both
nugget and cluster annotations, we are able to compare these two
approaches. Specically, we address questions related to evaluation
eort, dierences in the nal evaluation products, and correlations
between scores and rankings generated by both approaches. We
summarize advantages and disadvantages of nuggets and clusters
to oer recommendations for future system evaluations.
Loading