How to Do Human Evaluation: Best Practices for User Studies in NLP

Anonymous

How to Do Human Evaluation: Best Practices for User Studies in NLP

Anonymous

17 Jul 2021 (modified: 05 May 2023)ACL ARR 2021 July Blind SubmissionReaders: Everyone

Abstract: Many research topics in natural language processing (NLP), such as explanation generation, dialog modeling or machine translation, require evaluation that goes beyond standard metrics like accuracy or F1 score toward a more human-centered approach. Therefore, understanding how to design user studies becomes increasingly important. However, few comprehensive resources exist on planning, conducting and evaluating user studies for NLP, making it hard to get started for researchers without prior experience in the field of human evaluation. In this paper, we summarize the most important aspects of user studies and their design and evaluation, providing direct links to NLP tasks and NLP specific challenges where appropriate. We (i) outline general study design, ethical considerations, and factors to consider for crowdsourcing, (ii) discuss the particularities of user studies in NLP and provide starting points to select questionnaires, experimental designs and evaluation methods that are tailored to the specific NLP tasks. Additionally, we offer examples with accompanying statistical evaluation code in R throughout, to bridge the gap between theoretical guidelines and practical applications.

Software: zip

0 Replies

Loading