To Improve, or Not to Improve; How Changes in Corpora Influence the Results of Machine Learning Tasks on the Example of Datasets Used for Paraphrase Identification

Krystyna Chodorowska, Barbara Rychalska, Katarzyna Pakulska, Piotr Andruszkiewicz

2019 (modified: 12 Nov 2021)Intelligent Methods and Big Data in Industrial Applications 2019Readers: Everyone

Abstract: In this paper we attempt to verify the influence of data quality improvements on results of machine learning tasks. We focus on measuring semantic similarity and use the SemEval 2016 datasets. To achieve consistent annotations, we made all sentences grammatically and lexically correct, and developed formal semantic similarity criteria. The similarity detector used in this research was designed for the SemEval English Semantic Textual Similarity (STS) task. This paper addresses two fundamental issues: first, how each characteristic of the chosen sets affects performance of similarity detection software, and second, which improvement techniques are most effective for provided sets and which are not. Having analyzed these points, we present and explain the not obvious results we obtained.

0 Replies