Data Quality in NLP: Metrics and a Comprehensive Taxonomy

Published: 01 Jan 2024, Last Modified: 14 Jun 2024IDA (1) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Data quality is a crucial factor for the success of natural language processing (NLP) models. However, there is a lack of a standard taxonomy for data quality in NLP, which makes it difficult to assess and improve the quality of NLP datasets. In this work, we propose a comprehensive taxonomy for data quality in NLP, covering various aspects such as linguistic, semantic, anomaly, classifier performance, and diversity. We also introduce a novel metric to measure the difficulty of a dataset, which reflects the inherent challenges of the data. We evaluate our taxonomy using a wide variety of NLP datasets that span multiple domains and tasks. The results show that our taxonomy can effectively capture the changes in data quality and provide valuable insights for data creators and users. We believe that our work is a significant contribution to the field of NLP, as it provides a systematic and holistic approach to data quality assessment.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview