Keywords: Reproduction, Evaluation, Dialogue Summarization
TL;DR: A reproduction study of DialSummEval - Evaluation of automatic summarization evaluation metrics
Abstract: Scope of Reproducibility — In this paper, we perform a reproduction study of the original work of Gao and Wan  on the evaluation of automatic dialogue summarization metrics and models. They concluded that (1) few metrics are efficient across dimensions, (2) metrics perform differently in the dialogue summarization task than when evaluating conventional summarization, (3) models tailored for dialogue summarization capture coherence and fluency better than consistency and relevance. Methodology — Three annotators evaluated the outputs of 13 summarization models and their human reference summaries, following the guidelines of the original paper. This took on average 20 hours. A new annotation tool was developed to address the limitations of the Excel interface. An ablation study was conducted with a subset of data annotated with the original process. Finally, we implemented modified parts of the author’s code to apply the metrics over the summaries and compare their scores with our human judgments. All experiments were run on CPU. Results — The original paper’s main claims were reproduced. While not all original authors’ arguments were replicated (e.g. ROUGE scoring higher for relevance), the correlation between metrics and human judgments showed similar tendencies as in . The annotations correlated with the original at a Pearson score of 0.6, sufficient for reproducing main claims. What was easy — The reproducibility strengths of the original paper lie primarily in its profound methodological description. The rich and detailed incorporation of tables made the comparison with our reproduced results fairly easy. What was difficult — The reimplementation of the original paper’s code was relatively complex to navigate and required a fair amount of debugging when running the metrics. Certain deficiencies in the annotation guidelines also resulted in rather time‐consuming decision‐making for the annotators. Finally, the methodological description of the post‐processing of the annotations was relatively unclear and the code calculating the inter‐annotator agreement was missing. Communication with original authors — We contacted the paper’s first author, twice, to request the annotation guidelines, the missing code parts, and clarifications regarding the annotation post‐processing. Their responses were prompt and helpful.
Paper Url: https://aclanthology.org/2022.naacl-main.418/
Paper Venue: Other venue (not in list)
Venue Name: NAACL 2022
Confirmation: The report pdf is generated from the provided camera ready Google Colab script, The report metadata is verified from the camera ready Google Colab script, The report contains correct author information., The report contains link to code and SWH metadata., The report follows the ReScience latex style guides as in the Reproducibility Report Template (https://paperswithcode.com/rc2022/registration)., The report contains the Reproducibility Summary in the first page., The latex .zip file is verified from the camera ready Google Colab script
Journal: ReScience Volume 9 Issue 2 Article 14