Data Statements for NLP: Toward Mitigating System Bias and Enabling Better Science

Emily M. Bender, Batya Friedman

30 Jul 2025 (modified: 25 Sept 2018)OpenReview Anonymous Preprint Blind SubmissionReaders: Everyone

Abstract: In this paper, we propose data statements as a design solution and professional practice for natural language processing technologists, in both research and development — through the adoption and widespread use of data statements, the field can begin to address critical scientific and ethical issues that result from the use of data from certain populations in the development of technology for other populations. We present a form that data statements can take and explore the implications of adopting them as part of regular practice. We argue that data statements will help alleviate issues related to exclusion and bias in language technology; lead to better precision in claims about how NLP research can generalize and thus better engineering results; protect companies from public embarrassment; and ultimately lead to language technology that meets its users in their own preferred linguistic style and furthermore does not mis- represent them to others. ** To appear in TACL **

TL;DR: A practical proposal for more ethical and responsive NLP technology, operationalizing transparency of test and training data

Keywords: NLP, Bias, Data Ethics, Data statements, Inclusive design, Value scenario, Value sensitive design

0 Replies