(Mostly) Automatic Experiment Execution for Human Evaluations of NLP Systems

Craig Thomson, Anya Belz

Published: 2024, Last Modified: 18 May 2025INLG 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Human evaluation is widely considered the most reliable form of evaluation in NLP, but recent research has shown it to be riddled with mistakes, often as a result of manual execution of tasks. This paper argues that such mistakes could be avoided if we were to automate, as much as is practical, the process of performing experiments for human evaluation of NLP systems. We provide a simple methodology that can improve both the transparency and reproducibility of experiments. We show how the sequence of component processes of a human evaluation can be defined in advance, facilitating full or partial automation, detailed preregistration of the process, and research transparency and repeatability.