Investigating Relationships between Accuracy and Diversity in Multi-Reference Text Generation

Weike Fang

Published: 18 Aug 2022, Last Modified: 30 Sept 2024Proceedings of KDD Undergraduate Consortium (KDD-UC)EveryoneCC BY 4.0

Abstract: In text generation, we aim to produce outputs that are not only correct but also diverse in terms of content, use of words, and meaning. The ability to generate accurate and diverse text is crucial in conversation systems, story generation, machine translation, paraphrasing, commonsense reasoning, etc. To efficiently evaluate the generated text, researchers have extensively studied automatic evaluation metrics to substitute expensive, slow human evaluation. Existing metrics include $n$-gram-based metrics and neural-based metrics. The former perform well on measuring form or lexical quality and diversity while the latter excel at detecting semantic quality and diversity, both showing good correlation with human judgments. In this work, we observe that the trade-off between semantic quality and diversity occurs in the output of models trained for multi-reference text generation, making it hard to find the optimal model by looking at quality and diversity metrics separately. We propose a human study framework and provide methods to generate experiment data for researchers to design or evaluate new metrics in the future.