Abstract: In text generation, we aim to produce outputs that are not only
correct but also diverse in terms of content, use of words, and
meaning. The ability to generate accurate and diverse text is crucial
in conversation systems, story generation, machine translation,
paraphrasing, commonsense reasoning, etc. To efficiently evaluate
the generated text, researchers have extensively studied automatic
evaluation metrics to substitute expensive, slow human evaluation.
Existing metrics include $n$-gram-based metrics and neural-based
metrics. The former perform well on measuring form or lexical
quality and diversity while the latter excel at detecting semantic
quality and diversity, both showing good correlation with human
judgments. In this work, we observe that the trade-off between semantic quality and diversity occurs in the output of models trained
for multi-reference text generation, making it hard to find the optimal model by looking at quality and diversity metrics separately.
We propose a human study framework and provide methods to
generate experiment data for researchers to design or evaluate new
metrics in the future.
Loading