Near duplicate column identification: a machine learning approach

Marc Chevallier, Faouzi Boufarès, Nistor Grozavu, Nicoleta Rogovschi, Charly Clairmont

Published: 2021, Last Modified: 15 May 2025SSCI 2021EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Data quality is a global issue in our society that every company encounter. Data quality is a vast field of study, our work is focused on relational Data. A lot have been done in this field to identify duplicate lines, but here we will focus our work on columns. We define the new concept of near duplicate columns, that characterises two columns that are really similar to each other. We introduce a method to determine if two columns are near duplicate. We first describe a method that works for a specific column and then generalize this method to any couple of column. This method relies on the use of classifiers trained on artificial data-set to determine if two columns are near duplicate. In this study, we try multiple possible choices of classifiers to find the most appropriate one for this learning problem. We also explore the effect of modifying experimental parameters during the generation of the artificial training data.