Non-Repeatable Experiments and Non-Reproducible Results: The Reproducibility Crisis in Human Evaluation in NLP

Published: 01 Jan 2023, Last Modified: 18 Jun 2024ACL (Findings) 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Human evaluation is widely regarded as the litmus test of quality in NLP. A basic requirementof all evaluations, but in particular where they are used for meta-evaluation, is that they should support the same conclusions if repeated. However, the reproducibility of human evaluations is virtually never queried, let alone formally tested, in NLP which means that their repeatability and the reproducibility of their results is currently an open question. This focused contribution reports our review of human evaluation experiments reported in NLP papers over the past five years which we assessed in terms oftheir ability to be rerun. Overall, we estimatethat just 5% of human evaluations are repeatable in the sense that (i) there are no prohibitivebarriers to repetition, and (ii) sufficient information about experimental design is publicly available for rerunning them. Our estimate goesup to about 20% when author help is sought. We complement this investigation with a survey of results concerning the reproducibilityof human evaluations where those are repeatable in the first place. Here we find worryinglylow degrees of reproducibility, both in terms ofsimilarity of scores and of findings supportedby them. We summarise what insights can begleaned so far regarding how to make humanevaluations in NLP more repeatable and morereproducible.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview