New-Wiki Eval: An Evolving Wikipedia Multi-metric Evaluation for Large Language Models

Anonymous

New-Wiki Eval: An Evolving Wikipedia Multi-metric Evaluation for Large Language Models

Anonymous

17 Apr 2023ACL ARR 2023 April Blind SubmissionReaders: Everyone

Abstract: Latest large language models (LLM) like GPT-3 are able to generate long articles that are indistinguishable from human-written ones. However, the evaluation of text generation remains challenging. While human evaluations of generated articles are shown to be expansive and slow, researchers cannot find good automatic evaluation methods because of the lack of out-of-sample reference text and the creativity of long text generation. We made a key observation that Wikipedia is constantly evolving and thus provide a good-quality out-of-sample test set for LLMs. Thus, in this paper, we propose a new evaluation framework for LLM's long text generation. We first let the LLMs do "Wikipedia generation" and then select a set of evaluation metrics to evaluate the generation from multiple perspectives. In practice, we evaluate state-of-the-art LLMs including GPT-3, BLOOM, OPT, GLM, BART, and T5 and show the evaluation results under our framework correlate with prior research.

Paper Type: long

Research Area: Resources and Evaluation

0 Replies

Loading