Abstract: Data lakes are repositories of data with potential for analysis. Data lakes aim to liberate data from silos, thereby enabling cross-cutting analyses that were hitherto out of reach. However, the cost of adding data to data lakes must be kept to a minimum, and thus data sets tend to be stored in their original form, typically with limited metadata. This gives rise to significant challenges for data scientists simply discovering what data sets may be relevant to a task-in-hand. Given a data set of interest, several proposals have been made for indexing schemes that can identify related data sets. However, such schemes tend to build on similarity metrics that may not correlate well with suitability, and that stop short of providing a clear explanation as to how an identified data set relates to a provided target. We address this problem by applying Natural Language Inference (NLI) into the data discovery problem, with a view to providing explanations as to how the attributes of retrieved data sets relate to those of the target, in terms of a collection of semantic relations. We provide two approaches to inferring semantic relations: (a) by performing unsupervised intensional and extensional analysis of the data sources using Natural Language Processing techniques; and (b) by performing supervised learning of semantic relations by applying BERT over source schema information. The contributions of this paper are: an NLI strategy for providing explicit characterisation of semantic relations between data sets in a data lake; two approaches to inferring the semantic relations between the sources; and an empirical evaluation of the approaches using open government data.
First Author Is Student: Yes
Subtrack: Matching, Integration, and Fusion