CurrentClean: Spatio-Temporal Cleaning of Stale Data

Published: 01 Jan 2019, Last Modified: 21 Jan 2025ICDE 2019EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Data currency is imperative towards achieving up-to-date and accurate data analysis. Data is considered current if changes in real world entities are reflected in the database. When this does not occur, stale data arises. Identifying and repairing stale data goes beyond simply having timestamps. Individual entities each have their own update patterns in both space and time. These update patterns can be learned and predicted given available query logs. In this paper, we present CurrentClean, a probabilistic system for identifying and cleaning stale values. We introduce a spatio-temporal probabilistic model that captures the database update patterns to infer stale values, and propose a set of inference rules that model spatio-temporal update patterns commonly seen in real data. We recommend repairs to clean stale values by learning from past update values over cells. Our evaluation shows CurrentClean's effectiveness to identify stale values over real data, and achieves improved error detection and repair accuracy over state-of-the-art techniques.
Loading