Abstract: Capturing the essence of a collection of documents through short textual descriptions that capture the salient themes, is a common and useful practice. However, evaluating such sets relies heavily on slow, laborious and subjective human annotation procedures. To address this, we introduce TDSetScore, an automatic reference-less methodology for evaluating sets of theme-representing descriptions. TDSetScore decomposes the evaluation into three annotation tasks that define five scores along different quality aspects. This framing simplifies and expedites the manual evaluation process and enables automatic and independent LLM-based evaluation. As a test case, we apply our approach to a corpus of Holocaust survivor testimonies, motivated both by its relevance to the task and by the moral significance of this pursuit. We validate the methodology by experimenting with natural and synthetic generation systems and compare their performance with the methodology.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: corpus creation, benchmarking, automatic creation and evaluation of language resources, NLP datasets, automatic evaluation of datasets, evaluation methodologies, evaluation, metrics, statistical testing for evaluation
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: English
Submission Number: 162
Loading