Mitigating Data Sparsity in Integrated Data through Text Conceptualization

Md Ataur Rahman, Sergi Nadal, Oscar Romero, Dimitris Sacharidis

Published: 2024, Last Modified: 02 Oct 2024ICDE 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We study the data sparsity problem for data generated from an integration system. We approach the problem from a textual information extraction perspective and propose to conceptualize external documents using the concepts in the integrated schema. We present THOR, a novel system that, unlike related approaches, neither relies on complex rules nor models trained with large annotated corpus, but on the integrated data and its schema without the need for human annotations. An extensive evaluation on the text conceptualization task demonstrates the superiority of our approach in terms of F1-score, effort and use of resources over the state-of-the-art language models.