TL;DR: self-improving compositional language model programs for schema matching across heterogenous data sources
Abstract: Schema matching -- the task of finding matches between attributes across disparate data sources with different tables and hierarchies -- is critical for creating interoperable machine learning (ML)-ready data. Addressing this fundamental data-centric problem has wide implications, especially in domains like healthcare, finance and e-commerce --- but also has the potential to benefit ML models more generally, by increasing the data available for ML model training. However, schema matching is a challenging ML task due to structural/hierarchical and semantic heterogeneity between different schemas. Previous ML approaches to automate schema matching have either required significant labeled data for model training, which is often unrealistic or suffer from poor zero-shot performance. To this end, we propose Matchmaker - a compositional language model program for schema matching, comprised of candidate generation, refinement and confidence scoring. Matchmaker also self-improves in a zero-shot manner without the need for labeled demonstrations via a novel optimization approach, which constructs synthetic in-context demonstrations to guide the language model's reasoning process. Empirically, we demonstrate on real-world medical schema matching benchmarks that Matchmaker outperforms previous ML-based approaches, highlighting its potential to accelerate data integration and interoperability of ML-ready data.
Lay Summary: Machine learning models need large, unified datasets to work well, but in reality, data often comes from many different sources with incompatible formats. This leads to a fundamental data interoperability challenge. This data mismatch forces researchers to either manually connect these data pieces (which took experts 500 hours for just one medical database) or use smaller, incomplete datasets that limit their models' potential. We propose Matchmaker, a self-improving LLM system that automatically finds these connections between different data formats. Unlike previous approaches that simply compare names for similarity, Matchmaker uses a multi-step reasoning process: it generates potential matches using both semantic similarity and logical reasoning, refines these candidates, and assigns confidence scores to each match. Crucially, it can also recognize when no good match exists. The system even improves itself by learning from its own successful examples, without needing human-labeled training data. When tested on complex medical databases, Matchmaker outperformed existing methods by 20%, making it significantly faster and more accurate to combine data from different sources. This could accelerate medical research, business analytics, and any field where combining diverse datasets is essential for building better AI systems.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Primary Area: Applications->Health / Medicine
Keywords: schema matching, healthcare, Large Language Models, data-centric AI
Submission Number: 10278
Loading