Distributed Entity Resolution Based on Similarity Join for Large-Scale Data Clustering

Tiezheng Nie, Wang-Chien Lee, Derong Shen, Ge Yu, Yue Kou

Published: 2014, Last Modified: 17 Apr 2025WAIM 2014EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Entity resolution has been widely used in data mining applications to find similar records. However, the increasing scale and complexity of data has restricted the performance of entity resolution. In this paper, we propose a novel entity resolution framework that clusters large-scale data with distributed entity resolution method. We model the clustering problem as finding similarity sub connected graphs from records. Firstly, our approach finds pairs of records whose similarities are above a given threshold based on appjoin algorithm which extends the ppjoin algorithm and are executed on MapReduce framework. Then, we propose a cache-based algorithm which cluster entities with similar pairs based on the Disjoin Set algorithm and are also designed for MapReduce framework. Experimental results on real dataset show that our algorithms can achieve more efficiency than previous algorithms on the entity resolution and clustering.