Evaluating Open-ended Text Generation of Large Language Models using Spectral Distances of Surprisal

Evaluating Open-ended Text Generation of Large Language Models using Spectral Distances of Surprisal

ACL ARR 2025 May Submission5432 Authors

20 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We propose a novel automatic evaluation metric for open-ended text generation, which is a substantial improvement of the recently developed method, Fourier analysis of cross-entropy (FACE), hence, FACE-2. FACE-2 is a psycholinguistically inspired metric that extracts the dynamic patterns (spectrum) of text surprisal. Examined with open-ended text generation tasks, FACE-2 significantly outperforms a broad set of baseline metrics in revealing the model scaling effect, which scales up to models of 70B parameters, while many other existing metrics fail to capture this effect. We have also confirmed the advantage of FACE-2 in producing stronger agreement with human preferences from a large human-annotated dataset. We advocate for including metrics that mine the dynamics of likelihood in evaluating open-ended text generation, which covers broader aspects of human language than only using static likelihood-based or semantic-based metrics.

Paper Type: Long

Research Area: Generation

Research Area Keywords: Evaluation Methodologies, Automatic Evaluation of Datasets, Metrics, Computational Psycholinguistics

Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models

Languages Studied: English, Chinese

Submission Number: 5432

Loading