Evaluating Text Generation Quality Using Spectral Distances of Surprisal

Evaluating Text Generation Quality Using Spectral Distances of Surprisal

ACL ARR 2025 February Submission3585 Authors

15 Feb 2025 (modified: 13 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Surprisal-based metrics are commonly used for evaluating the quality of natural language generation outputs, especially for open-ended generation tasks. This paper proposes a novel metric that utilizes the spectral features of text surprisal, which is an improved version for a recently developed method, Fourier analysis of cross-entropy (FACE), hence, FACE-2. The main thought of the metric is inspired by empirical findings about the periodicity in human language production. The key improvements in FACE-2 include: adding necessary processing steps; a thorough examination over distance functions for measuring spectral similarity; extended studies on larger models and datasets. Examined with open-ended text generation tasks, FACE-2 significantly outperforms its predecessor and a broad set of baseline metrics in revealing the model scaling effect. We have also confirmed the advantage of FACE in producing stronger agreement with human preferences in a larger human-annotated dataset, compared with other broadly used metrics.

Paper Type: Long

Research Area: Generation

Research Area Keywords: evaluation methodologies, automatic evaluation of datasets, metrics, computational psycholinguistics

Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models

Languages Studied: English, Chinese

Submission Number: 3585

Loading