ExSiM: Towards Explainable Automated Evaluation of NLG Systems

Anonymous

ExSiM: Towards Explainable Automated Evaluation of NLG Systems

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: Automated evaluation of Natural Language Generation (NLG) systems is hard. The common practice of evaluating NLG systems involves computing the similarity between a collection of automatically generated documents and their corresponding (human-written) golden reference documents. Unfortunately, existing document similarity metrics are black boxes and, thus, hard to interpret and explain, making robust evaluation of NLG systems even more challenging. To address this issue, this paper introduces a new evaluation metric called ExSiM that provides a vector of scores instead of a single similarity score, where each component of the vector describes a particular property of the similarity metric, thus providing a natural way of explanation. Our experimental results with Wikipedia article triplets and a custom-created narrative dataset demonstrate that the proposed ExSiM vector can perform comparably to traditional metrics like BERTScore and ROUGE in terms of undirected similarity assessment while providing useful explanations and yielding a higher human-machine agreement in directed similarity assessment.

Paper Type: long

Research Area: Resources and Evaluation

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

0 Replies

Loading