FrepJoin: an efficient partition-based algorithm for edit similarity join

Jizhou Luo, Shengfei Shi, Hongzhi Wang, Jianzhong Li

2017 (modified: 13 Jun 2021)Frontiers Inf. Technol. Electron. Eng. 2017Readers: Everyone

Abstract: String similarity join (SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-andrefine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics. The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.

0 Replies