Enhancing Entity Resolution Through Graph-Based Data Augmentation and Label Noise Identification

Published: 01 Jan 2025, Last Modified: 09 Nov 2025IEEE Access 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Entity resolution is an important task in data integration and data cleaning. This study explores how graph-based techniques can address the challenges of data availability and label quality. An approach is formalized that leverages graph theory to enhance entity-resolution performance by augmenting datasets and identifying mislabeled data points. To evaluate three methods of the graph-based approach, that is data augmentation, consistency check, and loss-outlier detection, a framework to generate entity resolution datasets with varying properties such as label noise, class distribution, or amount of entities is developed. The methods are then evaluated using this framework and ten different real-world datasets. The main findings are as follows: On average, the amount of usable data can be increased by a factor of 2.5 without introducing additional label noise. Among the methods aimed at identifying label noise, the consistency check is able to detect approximately 16% of incorrectly labeled data points. The loss-based outlier detection method removes about 26% of the false labels, although this entails discarding 10% of the overall data points. Further analyses are conducted to investigate the factors influencing these methods and to better understand their mechanisms of action. The contributions of this study include: 1) formalizing previous preliminary work on graph-based augmentation, 2) developing a framework for entity-resolution dataset generation, and 3) providing in-depth evaluation of the performance and influencing factors of graph-based augmentation and label-noise identification.
Loading