Abstract: Missing values in tabular data lakes can severely impact data analysis and diminish the performance in downstream applications. We highlight that a robust imputation strategy should properly take three aspects of variety into consideration: source of imputed value, the types of tables involved, and the data types of the missing value. Existing imputation methods rely on estimation-based approaches (using a model trained on data from the same table to estimate missing values) or search-based approaches (retrieving values from other tables). Unfortunately, none of these approaches effectively incorporate all three aspects of variety. To address this gap, we propose CESID, a novel framework that uses a Combination of Estimation-based and Search-based methods for missing value Imputation in Data lakes. CESID contains three core modules: (1) the Contextual Search Module, which efficiently discovers candidate values from tables by exploiting the contextual information; (2) the Acquisition-guided Estimation Module, which introduces an influence function and a sampling-based exploration strategy to yield accurate estimated values; (3) the Classifier Module, which determines the most suitable method based on table-level and column-level statistics. Extensive experiments conducted on three data lakes demonstrate that CESID effectively and efficiently addresses the missing value problem.
External IDs:doi:10.1007/s00778-025-00957-1
Loading