Not All Metrics Are Guilty: Improving NLG Evaluation with LLM ParaphrasingDownload PDF

Anonymous

16 Oct 2023ACL ARR 2023 October Blind SubmissionReaders: Everyone
Abstract: Most research about natural language generation (NLG) relies on evaluation benchmarks with limited references for a sample, which may result in poor correlations with human judgements. The underlying reason is that one semantic meaning can actually be expressed in different forms, and the evaluation with a single or few references may not accurately reflect the quality of the model's hypotheses. To address this issue, this paper presents a simple and effective method, named Para-Ref, to enhance existing evaluation benchmarks by enriching the number of references. We leverage large language models (LLMs) to paraphrase a single reference into multiple high-quality ones in diverse expressions. Experimental results on representative NLG tasks of machine translation, text summarization, and image caption demonstrate that our method can effectively improve the correlation with human evaluation for seventeen automatic evaluation metrics. From the word-based BLEU metric to the LLM-based GEMBA metric can all benefit from more our Para-Ref method. We strongly encourage future generation benchmarks to include more references, even if they are paraphrased using LLMs, which is once for all.
Paper Type: long
Research Area: Generation
Contribution Types: NLP engineering experiment, Data resources, Position papers
Languages Studied: English, Chinese, German, Russian
Consent To Share Submission Details: On behalf of all authors, we agree to the terms above to share our submission details.
0 Replies

Loading