Abstract: Data integration technology can integrate data from different data sources, making it convenient and prompt to use heterogeneous data when processing big data. Therefore, data integration plays an important role in many industries. Recently, more and more work is devoted to data integration for relational data aiming at mining the underlying knowledge from it. Through embedding technology, the features of data can be extracted and expressed in the low-dimensional vectors. Some existing methods took records, attributes and cell values in relational data as various research objects to calculate their embedding representations, but the three types of data objects were trained uniformly in these methods ignoring the differences between multiple types of data. In this paper, we transform the relational data into a heterogeneous graph where different levels of data are treated as different types of nodes. In the training process, different calculation methods are adopted for corresponding node types according to their own characteristics, so that to obtain more accurate embedding representations for data. Then the embeddings are applied to the specific tasks of data integration. The experimental results show that the data embeddings trained by proposed model have good universality and achieve satisfying results in both schema matching and entity resolution tasks.
Loading