Accuracy is not enough: Evaluating Personalization in Summarizers

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 FindingsEveryoneRevisionsBibTeX
Submission Type: Regular Long Paper
Submission Track: Summarization
Submission Track 2: Resources and Evaluation
Keywords: Personalized Summarization Evaluation, Meta Evaluation, Automated Accuracy Metrics
TL;DR: In this paper we demonstrate that degree of personalization is an independent characteristic than accuracy and hence, propose a novel measure called $e$-$DINS_{sub}$ for computing the degree of personalization of a summarization model.
Abstract: Text summarization models are evaluated in terms of their accuracy and quality using various measures such as ROUGE, BLEU, METEOR, BERTScore, PYRAMID, readability, and several other recently proposed ones. The central objective of all accuracy measures is to evaluate the model's ability to capture $\textit{saliency}$ accurately. Since saliency is subjective w.r.t the readers' preferences, there cannot be a fit-all summary for a given document. This means that in many use-cases, summarization models need to be personalized w.r.t user-profiles. However, to our knowledge, there is no measure to evaluate the $\textit{degree-of-personalization}$ of a summarization model. In this paper, we first establish that existing accuracy measures cannot evaluate the degree of personalization of any summarization model, and then propose a novel measure, called $EGISES$, for automatically computing the same. Using the PENS dataset released by Microsoft Research, we analyze the degree of personalization of ten different state-of-the-art summarization models (both extractive and abstractive), five of which are explicitly trained for personalized summarization, and the remaining are appropriated to exhibit personalization. We conclude by proposing a generalized accuracy measure, called $P$-$Accuracy$, for designing accuracy measures that should also take personalization into account and demonstrate the robustness and reliability of the measure through meta-evaluation.
Submission Number: 1985
Loading