Abstract: Surprisal-based metrics are commonly used for evaluating the quality of natural language generation outputs, especially for open-ended generation tasks. This paper proposes a novel metric that utilizes the spectral features of text surprisal, which is an improved version for a recently developed method, Fourier analysis of cross-entropy (FACE), hence, FACE-2. The main thought of the metric is inspired by empirical findings about the periodicity in human language production. The key improvements in FACE-2 include: adding necessary processing steps; a thorough examination over distance functions for measuring spectral similarity; extended studies on larger models and datasets. Examined with open-ended text generation tasks, FACE-2 significantly outperforms its predecessor and a broad set of baseline metrics in revealing the model scaling effect. We have also confirmed the advantage of FACE in producing stronger agreement with human preferences in a larger human-annotated dataset, compared with other broadly used metrics.
Paper Type: Long
Research Area: Generation
Research Area Keywords: evaluation methodologies, automatic evaluation of datasets, metrics, computational psycholinguistics
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models
Languages Studied: English, Chinese
Submission Number: 3585
Loading