ANYMATCH – Efficient Zero-Shot Entity Matching with a Small Language Model

Zeyu Zhang; Paul Groth; Iacer Calixto; Sebastian Schelter

ANYMATCH – Efficient Zero-Shot Entity Matching with a Small Language Model

Zeyu Zhang, Paul Groth, Iacer Calixto, Sebastian Schelter

Published: 13 Dec 2024, Last Modified: 26 Feb 2025Good-DataEveryoneRevisionsBibTeXCC BY 4.0

Student Lead Author Indication: Yes

Keywords: Data Integration, Entity Matching, Tabular Understanding

TL;DR: We propose ANYMATCH, a zero-shot entity matching method using a small GPT-2 model with data selection techniques to achieve performance on par with trillion-parameter models at significantly lower deployment cost.

Abstract: Entity matching (EM)—identifying whether two records refer to the same entity—is critical in data integration. Many EM methods rely heavily on labelled examples, limiting their applicability in real world settings. We address the challenging task of zero-shot entity matching, where no labelled examples are available for an unseen target dataset. Our approach, ANYMATCH, leverages a fine-tuned GPT-2 model, enhanced with novel data selection and augmentation techniques within a transfer learning framework. This design enables ANYMATCH to achieve predictive performance competitive with much larger language models while providing substantial efficiency gains. Extensive evaluations across nine benchmark datasets and comparisons with thirteen baselines reveal that ANYMATCH attains the second-highest overall F1 score, outperforming multiple models with hundreds of billions of parameters. Additionally, ANYMATCH offers significant cost advantages: its average prediction quality is within 4.4% of the proprietary trillion-parameter MatchGPT model, yet requires four orders of magnitude fewer parameters and achieves a 3,899-fold reduction in inference cost (in dollars per 1,000 tokens).

Submission Number: 11

Loading