Abstract: Generative large language models (LLMs) have seen many breakthroughs over the last year. With an increasing number of parameters and pre-training data, they have shown remarkable capabilities to solve tasks with minimal or no task-related examples. Notably, LLMs have been successfully employed as evaluation metrics in text generation tasks. Strategies employed in this context differ in the choice of input prompts, the selection of samples for demonstration, and the methodology used to construct scores grading the generations. Approaches often differ in the input prompts, the samples that are selected for demonstration and the construction process of scores from the output. Within this context, we introduce the Eval4NLP 2023 shared task that asks participants to explore such approaches for machine translation evaluation and summarization eval- uation. Specifically, we select a list of allowed LLMs and disallow fine-tuning to ensure a focus on prompting. We test the approaches of the participants on a new reference-free test-set spanning 3 language pairs for machine transla- tion as well as a summarization dataset. Further, we present an overview of the approaches taken by the participants, present their results on the test set and analyze paths for future work. Fi- nally, as a separate track, we perform a human evaluation of the plausibility of explanations given by the LLMs and its effect on model performance. We make parts of our code and datasets available.
Loading