ExSiM: Towards Explainable Automated Evaluation of NLG SystemsDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: Automated evaluation of Natural Language Generation (NLG) systems is hard. The common practice of evaluating NLG systems involves computing the similarity between a collection of automatically generated documents and their corresponding (human-written) golden reference documents. Unfortunately, existing document similarity metrics are black boxes and, thus, hard to interpret and explain, making robust evaluation of NLG systems even more challenging. To address this issue, this paper introduces a new evaluation metric called ExSiM that provides a vector of scores instead of a single similarity score, where each component of the vector describes a particular property of the similarity metric, thus providing a natural way of explanation. Our experimental results with Wikipedia article triplets and a custom-created narrative dataset demonstrate that the proposed ExSiM vector can perform comparably to traditional metrics like BERTScore and ROUGE in terms of undirected similarity assessment while providing useful explanations and yielding a higher human-machine agreement in directed similarity assessment.
Paper Type: long
Research Area: Resources and Evaluation
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview