PerSEval: Assessing Personalization in Text Summarizers

TMLR Paper2691 Authors

14 May 2024 (modified: 26 Jul 2024)Under review for TMLREveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Personalized summarization models cater to individuals' subjective understanding of saliency, as represented by their reading history and current topics of attention. Existing personalized text summarizers are primarily evaluated based on accuracy measures such as BLEU, ROUGE, and METEOR. However, a recent study argued that accuracy measures are inadequate for evaluating the $\textit{degree of personalization}$ of these models and proposed EGISES, the first metric to evaluate personalized text summaries. It was suggested that accuracy is a separate aspect and should be evaluated standalone. In this paper, we challenge the necessity of an accuracy leaderboard, suggesting that relying on accuracy-based aggregated results might lead to misleading conclusions. To support this, we delve deeper into EGISES, demonstrating both theoretically and empirically that it measures the $\textit{degree of responsiveness}$, a necessary but not sufficient condition for degree-of-personalization. We subsequently propose PerSEval, a novel measure that satisfies the required sufficiency condition. Based on the benchmarking of ten SOTA summarization models on the PENS dataset, we empirically establish that -- (i) PerSEval is reliable w.r.t human-judgment correlation (Pearson's $r$ = 0.73; Spearman's $\rho$ = 0.62; Kendall's $\tau$ = 0.42), (ii) PerSEval has high rank-stability, (iii) PerSEval as a rank-measure is not entailed by EGISES-based ranking, and (iv) PerSEval can be a standalone rank-measure without the need of any aggregated ranking.
Submission Length: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=JZc74bUUCp&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions)
Changes Since Last Submission: Dear Action Editors and Reviewers, We thank the reviewers for their time and valuable feedback. We believe it has helped us improve the quality of the manuscript. Following suggestions have been addressed. Please refer the attached supplementary material. $\textbf{Review of Paper2691 by Reviewer hE9R}$ $\textbf{Summary of Suggested Revisions:}$ 1. While this paper seems to be generally sound, the research scope is somewhat narrow. As the authors have already argued that they have not found any containing user behavior history, such as user-click timestamp records, other than the news dataset, the generality could be limited. .... $\textbf{Review of Paper2691 by Reviewer bX5T}$ $\textbf{Summary of Suggested Revisions:}$ 1. Please add a distinct discussion of the difference of P-Acc to EGISES. Optimally this should be worked in to the introduction and main argumentation for the paper as well, as this is the main piece of related work here. Does P-Acc potentially not solve the problem and one needs a better metric? Can we still make the triangulation proof on the basis of P-Acc? Is the ranking of P-Acc inferior to PerSEval? Until this is clarified, I would not agree that claims and evidence are clear.-- 2. Would it be able to infer more information from the human judgement experient you conducted which would allow to compare EGISES / P-Acc / PerSEval on further? If so, please add such information or provide argumentation why this not possible / not needed at all. -- 3. Why are other variants of your system not as competing as PerSEval-InfoLM-$\alpha\beta$? Please add more discussions on your results, helping the reader to take informed design decisions. 4. Please add a discussion on the boundaies of such a crafted metric to a learned metric, which is not possible to be learned without feedback data. To what extent does the defined personalization property hold? 5. The presentation of the paper does not include sufficient examples for the accuracy measure for the claimed difference to personalization. Maybe this is too basic, but I am missing a better example on what accuracy means for a summary - .... $\textbf{Review of Paper2691 by Reviewer muUC}$ $\textbf{Summary of Suggested Revisions:}$ 1. The examples in page 3 (Alice) is not clear and confusing. 2. Other datasets results are needed to prove reliability. I am not convinced by its usefulness. 3. Discuss the computational implications of using PerSEval, such as the additional resources required due to the complexity of the method.
Assigned Action Editor: ~Manzil_Zaheer1
Submission Number: 2691
Loading