Efficiently Transforming Tables for JoinabilityDownload PDFOpen Website

Published: 2022, Last Modified: 17 Dec 2023ICDE 2022Readers: Everyone
Abstract: Data from different sources rarely conform to a single formatting even if they describe the same set of entities, and this raises concerns when data from multiple sources must be joined or cross-referenced. Such a formatting mismatch is unavoidable when data is gathered from various public and third-party sources. Commercial database systems are not able to perform the join when there exist differences in data representation or formatting, and manual reformatting is both time consuming and error-prone. We study the problem of efficiently joining textual data under the condition that the join columns are not formatted the same and cannot be equi-joined, but they become joinable under some transformations. The problem is challenging simply because the number of possible transformations explodes with both the length of the input and the number of rows, even if each transformation is formed using very few basic units. We show that an efficient algorithm can be developed based on the common characteristics of the joined columns and over a rich set of basic operations that can be composed to form transformations. Compared to a state-of-the-art approach, our algorithm covers every transformation that is covered in the state-of-the-art approach but is a few orders of magnitude faster, as evaluated on various real and synthetic data.
0 Replies

Loading