% DO NOT EDIT - automatically generated from metadata.yaml

\def \codeURL{https://github.com/tricodex/Reproducing_DialSummEval}
\def \codeDOI{}
\def \codeSWH{swh:1:dir:845557b246b9705efe933ab6deade75b4496a071}
\def \dataURL{}
\def \dataDOI{}
\def \editorNAME{}
\def \editorORCID{}
\def \reviewerINAME{}
\def \reviewerIORCID{}
\def \reviewerIINAME{}
\def \reviewerIIORCID{}
\def \dateRECEIVED{03 February 2023}
\def \dateACCEPTED{}
\def \datePUBLISHED{}
\def \articleTITLE{[Re] DialSummEval - Evaluation of automatic summarization evaluation metrics}
\def \articleTYPE{Replication}
\def \articleDOMAIN{ML Reproducibility Challenge 2022}
\def \articleBIBLIOGRAPHY{bibliography.bib}
\def \articleYEAR{2023}
\def \reviewURL{https://openreview.net/forum?id=3jaZ5tKRyiT&noteId=AHjI9Jw6frY}
\def \articleABSTRACT{Scope of Reproducibility — In this paper, we perform a reproduction study of the original work of Gao and Wan on the evaluation of automatic dialogue summarization metrics and models. They concluded that (1) few metrics are efficient across dimensions, (2) metrics perform differently in the dialogue summarization task than when evaluating conventional summarization, (3) models tailored for dialogue summarization capture coherence and fluency better than consistency and relevance. Methodology — Three annotators evaluated the outputs of 13 summarization models and their human reference summaries, following the guidelines of the original paper. This took on average 20 hours. A new annotation tool was developed to address the limitations of the Excel interface. An ablation study was conducted with a subset of data annotated with the original process. Finally, we implemented modified parts of the author’s code to apply the metrics over the summaries and compare their scores with our human judgments. All experiments were run on CPU. Results — The original paper’s main claims were reproduced. While not all original authors’ arguments were replicated (e.g. ROUGE scoring higher for relevance), the correlation between metrics and human judgments showed similar tendencies as in [1]. The annotations correlated with the original at a Pearson score of 0.6, sufficient for reproducing main claims. What was easy — The reproducibility strengths of the original paper lie primarily in its profound methodological description. The rich and detailed incorporation of tables made the comparison with our reproduced results fairly easy. What was difficult — The reimplementation of the original paper’s code was relatively complex to navigate and required a fair amount of debugging when running the metrics. Certain deficiencies in the annotation guidelines also resulted in rather time-consuming decision-making for the annotators. Finally, the methodological description of the post-processing of the annotations was relatively unclear and the code calculating the inter-annotator agreement was missing. Communication with original authors — We contacted the paper’s first author, twice, to request the annotation guidelines, the missing code parts, and clarifications regarding the annotation post-processing. Their responses were prompt and helpful.}
\def \replicationCITE{M. Gao and X. Wan. 'DialSummEval: Revisiting Summarization Evaluation for Dialogues.' In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022, pp. 5693–5709}
\def \replicationBIB{gao2022dialsummeval}
\def \replicationURL{https://aclanthology.org/2022.naacl-main.418/}
\def \replicationDOI{10.18653/v1/2022.naacl-main.418}
\def \contactNAME{Lea Krause}
\def \contactEMAIL{l.krause@vu.nl}
\def \articleKEYWORDS{rescience c, rescience x, machine learning, summarisation, evaluation, metrics, human annotation}
\def \journalNAME{ReScience C}
\def \journalVOLUME{9}
\def \journalISSUE{2}
\def \articleNUMBER{1}
\def \articleDOI{}
\def \authorsFULL{Patrick Camara et al.}
\def \authorsABBRV{P. Camara et al.}
\def \authorsSHORT{Camara et al.}
\title{\articleTITLE}
\date{}
\author[1,2,\orcid{0009-0005-7069-3337}]{Patrick Camara}
\author[1,2,\orcid{0009-0006-1199-599X}]{Mojca Kloos}
\author[1,2,\orcid{0009-0007-2366-4722}]{Vasiliki Kyrmanidi}
\author[1,2,\orcid{0009-0004-3876-9285}]{Agnieszka Kluska}
\author[1,2,\orcid{0009-0005-4224-0453}]{Rorick Terlou}
\author[1,\orcid{0009-0001-7187-5224}]{Lea Krause}
\affil[1]{Vrije Universiteit Amsterdam, Amsterdam, The Netherlands}
\affil[2]{Equal contributions}
