Open-Domain Text Evaluation via Contrastive Distribution Methods

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: metric learning, kernel learning, and sparse coding
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Natural language processing, text evaluation, natural language generation
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: A general open-domain text evaluation framework leveraging the dynamics of performance in pre-trained language models as a function of parameter count.
Abstract: Recent advancements in open-domain text generation, driven by the power of large pre-trained language models (LLMs), have demonstrated remarkable performance. However, assessing these models for specific attributes remains a challenge. Traditional reference-based metrics like BLEU, ROUGE, and METEOR measure the similarity between machine-generated outputs and human-written references, which deviates from the principle of open-ended generation tasks, leading to low correlation with human judgments. While trainable discriminator-based evaluation metrics show promise, the acquisition of high-quality training data presents a formidable obstacle. In this paper, we introduce a novel method for evaluating open-domain text generation called Contrastive Distribution Methods (CDM). Leveraging the connection between increasing model parameters and enhanced LLM performance, CDM creates a mapping from the \textit{contrast} of two probabilistic distributions -- one known to be superior to the other -- to quality measures. We investigate CDM for open-domain text generation evaluation under two paradigms: 1) \emph{Generative} CDM, which harnesses the contrast of two language models' distributions to generate synthetic examples for training discriminator-based metrics; 2) \emph{Discriminative} CDM, which directly uses distribution disparities between two language models for evaluation. Our experiments on multi-turn dialogue and factuality in abstractive summarization demonstrate that CDM correlate better with human judgment than existing automatic evaluation metrics on both tasks, highlighting the strong performance and generalizability of our approach.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6369
Loading