Abstract: Data quality is a crucial factor for the success of natural language processing (NLP) models. However, there is a lack of a standard taxonomy for data quality in NLP, which makes it difficult to assess and improve the quality of NLP datasets. In this work, we propose a comprehensive taxonomy for data quality in NLP, covering various aspects such as linguistic, semantic, anomaly, classifier performance, and diversity. We also introduce a novel metric to measure the difficulty of a dataset, which reflects the inherent challenges of the data. We evaluate our taxonomy using a wide variety of NLP datasets that span multiple domains and tasks. The results show that our taxonomy can effectively capture the changes in data quality and provide valuable insights for data creators and users. We believe that our work is a significant contribution to the field of NLP, as it provides a systematic and holistic approach to data quality assessment.
Loading