Automated Selection of Multiple Datasets for Extension by IntegrationOpen Website

2021 (modified: 07 Nov 2021)CIKM 2021Readers: Everyone
Abstract: Organizations often seek to extend their data by integration with available datasets originating from external sources. While there are many tools that recommend how to perform the integration for given datasets, the selection of what datasets to integrate is often challenging in itself. First, the relevant candidates must be efficiently identified among irrelevant ones. Next, relevant datasets need to be evaluated according to issues such as low quality or poor matching to the target data and schema. Last, jointly integrating multiple datasets may have significant benefits such as increasing completeness and information gain, but may also greatly complicate the task due to dependencies in the integration process. To assist administrators in this task, we quantify to what extent an integration of multiple datasets is valuable as an extension of an initial dataset and formalize the computational problem of finding the most valuable subset to integrate by this measure. We formally analyze the problem, showing that it is NP-hard; we nevertheless introduce heuristic efficient algorithms, which our experiments show to be near-optimal in practice and highly effective in finding the most valuable integration.
0 Replies

Loading