Exploiting Distant Supervision to Learn Semantic Descriptions of Tables with Overlapping Data

Published: 01 Jan 2024, Last Modified: 07 Oct 2025ISWC (2) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Understanding the semantic structure of tabular data is essential for data integration and discovery. Specifically, the goal is to annotate columns in a tabular source with types and relationships between them using classes and predicates of a target ontology. Previous work that exploits the matches between entities in a knowledge graph and the table data does not perform well for tables with noisy or ambiguous data. A key reason for this poor performance is the limited amount of labeled data to train these methods. To address this problem, we propose a novel distant supervision approach that leverages existing Wikipedia tables and hyperlinks to automatically label tables with their semantic descriptions. Then, we use the labeled dataset to train neural network models to predict the semantic description of a new table. Our empirical evaluation shows that using the automatically labeled dataset provides approximately 5% improvement in column type prediction and 4.5% improvement in column relationship prediction in F1 scores over the state-of-the-art on a large set of real-world tables.
Loading