Large Language Models for Semantic Join: A Comprehensive Survey

Kijae Hong, Yeonsu Park

Published: 01 Jan 2025, Last Modified: 21 Nov 2025IEEE AccessEveryoneRevisionsCC BY-SA 4.0
Abstract: Semantic join, the operation of integrating information siloed across heterogeneous data sources, is critical for modern data science, yet traditional methods have long been hampered by the persistent challenges of deep semantic ambiguity, poor scalability, and prohibitive manual intervention. This survey posits that Large Language Models (LLMs) represent a promising new paradigm, offering the potential to overcome these long-standing hurdles. By replacing brittle rules and syntactic analysis with deep contextual understanding, LLMs leverage their core capabilities in contextual representation learning and in-context learning (ICL) to automate and significantly improve the accuracy of linking records based on their underlying conceptual relatedness. We provide a comprehensive and structured review of this burgeoning field, synthesizing the state-of-the-art from foundational methodologies—such as data textualization and bi/cross-encoder architectures—to advanced techniques including Retrieval-Augmented Generation (RAG), structured prompting for complex reasoning, and optimizations for scalability. Furthermore, we survey transformative applications across enterprise data management, knowledge graph (KG) construction, and scientific research. By consolidating current knowledge, structuring the landscape of techniques, and identifying key open questions, this survey aims to catalyze future research and guide the development of the next generation of more powerful, reliable, and responsible semantic data integration solutions.
Loading