Clustering and Entity Matching via Language Model Community Detection

William Brown; Alexander Sabat; Adel Boyarsky; Rachel Cayale; Daniel DeDora; Yuriy Nevmyvaka; Nicholas M. Venuti

Clustering and Entity Matching via Language Model Community Detection

William Brown, Alexander Sabat, Adel Boyarsky, Rachel Cayale, Daniel DeDora, Yuriy Nevmyvaka, Nicholas M. Venuti

17 Sept 2024 (modified: 02 Dec 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: language models, LLMs, embeddings, entity matching, entity resolution, clustering, community detection, knowledge graphs, vector databases

TL;DR: Leveraging community detection for LLM-generated match graphs to improve performance and scalability in clustering/entity matching.

Abstract: We introduce LMCD, a novel framework for semantic clustering and multi-set entity matching problems, in which we employ graph community detection algorithms to prune spurious edges from match graphs constructed using embedding and language models. We construct these match graphs by retrieving nearest embedding neighbors for each entity, then querying a language model to remove false positive pairs. Across a variety of cluster size distributions, and for tasks ranging from sentiment and topic categorization to deduplication of product databases, our approach outperforms existing methods without requiring any finetuning or labeled data beyond few-shot examples, and without needing to select the desired number of clusters in advance. Our embedding and inference stages are fully parallelizable, with query and computational costs which scale near-linearly in the number of entities. Our post-processing stage is bottlenecked only by the runtime of community detection algorithms on discrete graphs, which are often near-linear, with no explicit dependence on embedding dimension or numbers of clusters. This is in stark contrast to existing methods relying on high-dimensional clustering algorithms that are difficult to apply at scale; for entity matching our approach also ensures consistency constraints across matches regardless of group sizes, a desirable practical feature which is absent from all prior approaches other than vector clustering. Our improvements over previous techniques are most stark when clusters are numerous and heterogenously-sized, a regime which captures many clustering and matching problems of widespread practical importance.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 1395

Loading