Abstract: With the increase of dirty data, data cleaning turns into a crux of data analysis. The accuracy limitation of the existing integrity constraints-based cleaning approaches results from insufficient rules. In this paper, we present a novel hybrid data cleaning framework on top of Markov logic networks (MLNs), termed as <inline-formula><tex-math notation="LaTeX">${\sf MLNClean}$</tex-math></inline-formula> , which is capable of learning instantiated rules to supplement the insufficient integrity constraints. <inline-formula><tex-math notation="LaTeX">${\sf MLNClean}$</tex-math></inline-formula> consists of two steps, i.e., <i>pre-processing</i> and <i>two-stage data cleaning</i> . In the pre-processing step, <inline-formula><tex-math notation="LaTeX">${\sf MLNClean}$</tex-math></inline-formula> first infers a set of probable instantiated rules according to MLNs and then builds a two-layer MLN index structure to generate multiple data versions and facilitate the cleaning process. In the two-stage data cleaning step, <inline-formula><tex-math notation="LaTeX">${\sf MLNClean}$</tex-math></inline-formula> first presents a concept of <i>reliability score</i> to clean errors within each data version separately, and afterward eliminates the conflict values among different data version using a novel concept of <i>fusion score</i> . Considerable experimental results on both real and synthetic scenarios demonstrate the effectiveness of <inline-formula><tex-math notation="LaTeX">${\sf MLNClean}$</tex-math></inline-formula> in practice.
0 Replies
Loading