Learning Personalized Alignment for Evaluating Open-ended Text GenerationDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: With rapid progress made in language qualities such as fluency and consistency via large language models (LLMs), there has been increasing interest in assessing alignment with diverse human preferences. Traditional metrics heavily rely on lexical similarity with human-written references and have been observed to suffer from a poor correlation with human evaluation. Furthermore, they ignore the diverse preferences of humans, a key aspect in evaluating open-ended tasks like story generation. Inspired by these challenges, we introduce a personalized evaluation framework PERSE to provide an interpretable evaluation from an individual perspective. PERSE first deduces the specific preference from several annotated data and then measures the quality based on this preference. Moreover, it offers an interpretable explanation for its evaluation, such as the score of different aspects. Through instruction-tuning on 10k data, our 13B LLaMA-2-based PERSE shows a 15.8\% increase in Kendall correlation and a 13.7\% rise in accuracy on zero-shot reviewers compared to GPT-4.
Paper Type: long
Research Area: Resources and Evaluation
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview