Similarity Analysis in Data Element Matching based on Word2vec

Wenhong Liu, Zhiyuan Peng, Shuang Zhao, Jiawei Liu

Published: 2022, Last Modified: 16 Feb 2025QRS Companion 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: With the increasing demand for computer-aided big data processing, deep learning has gradually become an effective means to help big data processing. There are often many redundant database fields between different departments. These fields are often completely equivalent, but there are certain differences in field names, which brings trouble to data element matching. To this end, we propose a more targeted approach - ‘MetaMatch’ to handle database fields, combining <tex>$W$</tex> ord2vec with a high-performance database. To measure the effectiveness of the proposed method, we propose a <tex>$W$</tex> ord2vec-based data element matching method. The method performs semantic segmentation on key fields of the database and trains word vectors. Then, we perform tokenization processing on each training case. According to the result of word segmentation, the corresponding word vector is constructed. We use this method to implement data element matching for big data systems in our experiments and design a validation experiment to evaluate the matching accuracy. The matching accuracy rate reached 79.3%.