Abstract: Correcting missing or erroneous data values is an essential task in data cleaning. Traditional pre-configuration error correction (EC) methods rely heavily on predefined rules or constraints, demanding significant domain knowledge and manual effort. While configuration-free EC approaches have been explored, they still demand extensive feature engineering or labeled data for intensive model training. In this paper, we propose a zero-training and interpretable EC system, named ZeroEC, that leverages large language models (LLMs) to generate chain-of-thoughts (CoTs) and correction rules for EC, without the need for model training. ZeroEC consists of two modules, contextual-relevant tuple search (CTS) and training-free explainable correction (TEC). CTS constructs a contextual-relevant tuple retriever using a weighted cosine similarity function to efficiently identify the most relevant tuples for each dirty tuple, reducing redundancy in the LLM prompts and lowering computational costs. TEC employs a clustering-based representative tuple sampling strategy to alleviate “hallucination” risk by exposing LLMs to diverse types of data errors. It further prompts for generating correction CoTs for user-corrected representative tuples, as well as prompts for creating correction rules and explainable ECs, which automatically provide explanations for EC, all without the need for model training. Extensive experiments conducted on various real-world datasets demonstrate that ZeroEC achieves a 66.82% increase in accuracy and a 6.87x speedup compared to state-of-the-art methods. The codes and datasets of this paper are available at https://github.com/YangChen32768/ZeroEC.
External IDs:dblp:conf/icde/WuYZMNXZY25
Loading