Effective entity matching with transformers

Yuliang Li, Jinfeng Li, Yoshi Suhara, AnHai Doan, Wang-Chiew Tan

Published: 01 Jan 2023, Last Modified: 29 Mar 2024VLDB J. 2023Readers: Everyone

Abstract: We present $$\textsf{Ditto}$$ Ditto , a novel entity matching system based on pre-trained Transformer language models. We fine-tune and cast EM as a sequence-pair classification problem to leverage such models with a simple architecture. Our experiments show that a straightforward application of language models such as BERT, DistilBERT, or RoBERTa pre-trained on large text corpora already significantly improves the matching quality and outperforms previous state-of-the-art (SOTA), by up to 29% of F1 score on benchmark datasets. We also developed three optimization techniques to further improve $$\textsf{Ditto}$$ Ditto ’s matching capability. $$\textsf{Ditto}$$ Ditto allows domain knowledge to be injected by highlighting important pieces of input information that may be of interest when making matching decisions. $$\textsf{Ditto}$$ Ditto also summarizes strings that are too long so that only the essential information is retained and used for EM. Finally, $$\textsf{Ditto}$$ Ditto adapts a SOTA technique on data augmentation for text to EM to augment the training data with (difficult) examples. This way, $$\textsf{Ditto}$$ Ditto is forced to learn “harder” to improve the model’s matching capability. The optimizations we developed further boost the performance of $$\textsf{Ditto}$$ Ditto by up to 9.8%. Perhaps more surprisingly, we establish that $$\textsf{Ditto}$$ Ditto can achieve the previous SOTA results with at most half the number of labeled data. Finally, we demonstrate $$\textsf{Ditto}$$ Ditto ’s effectiveness on a real-world large-scale EM task. On matching two company datasets consisting of 789K and 412K records, $$\textsf{Ditto}$$ Ditto achieves a high F1 score of 96.5%.

0 Replies