Abstract: Multi-source entity linkage focuses on integrating knowledge from
multiple sources by linking the records that represent the same real
world entity. This is critical in high-impact applications such as
data cleaning and user stitching. The state-of-the-art entity linkage
pipelines mainly depend on supervised learning that requires abundant amounts of training data. However, collecting well-labeled
training data becomes expensive when the data from many sources
arrives incrementally over time. Moreover, the trained models can
easily overfit to specific data sources, and thus fail to generalize
to new sources due to significant differences in data and label distributions. To address these challenges, we present AdaMEL, a
deep transfer learning framework that learns generic high-level
knowledge to perform multi-source entity linkage. AdaMEL models
the attribute importance that is used to match entities through an
attribute-level self-attention mechanism, and leverages the massive
unlabeled data from new data sources through domain adaptation
to make it generic and data-source agnostic. In addition, AdaMEL
is capable of incorporating an additional set of labeled data to more
accurately integrate data sources with different attribute importance. Extensive experiments show that our framework achieves
state-of-the-art results with 8.21% improvement on average over
methods based on supervised learning. Besides, it is more stable in
handling different sets of data sources in less runtime.
0 Replies
Loading