Abstract: Entity resolution is a key task in data integration and fusion, aiming to find all records describing the same real-world entity from multiple data sources. Blocking is an important step in entity resolution tasks to address the secondary time complexity challenge. Existing blocking methods based on token-based keys result in many redundant comparisons, while learning-based blocking methods incur significant time overhead during blocking generation. Therefore, we propose a Blocking with Vector Similarity Search (B-VSS) framework, which is based on high-dimensional nearest neighbor search, aiming to balance the effectiveness and efficiency of blocking. B-VSS mainly consists of two key stages. First, in the record embedding stage, we utilize deep learning models to generate vector representations for records. Secondly, in the blocking generation stage, after building an index for the dataset, the generated index is used to quickly retrieve records similar to the query and cluster them into blocks, thus significantly reducing the computational complexity. Through experimental analysis, we compare various methods of index construction under the blocking framework on 9 datasets. The results show that our methods can guarantee the quality of generated blocks and improve the speed.
Loading