Conformal Data Cleaning: Statistical Guarantees for Data Quality Automation in Tables

Conformal Data Cleaning: Statistical Guarantees for Data Quality Automation in Tables

TMLR Paper984 Authors

22 Mar 2023 (modified: 17 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Machine Learning (ML) components have become ubiquitous in modern software systems. In practice, there remain major challenges associated with both the translation of research innovations to real-world applications as well as the maintenance of ML components. Many of these challenges, such as high predictive performance, robustness, and ethical concerns, are related to data quality and, in particular, to the lack of automation in data pipelines upstream of ML components. While there are many approaches developed for automating data quality monitoring and improvement, it remains an open research question to what extent data cleaning can be automated. Many of the solutions proposed are tailored to specific use cases or application scenarios, require manual heuristics, or cannot be applied to heterogeneous data sets. More importantly, most approaches do not lend themselves easily to full automation. Here, we propose a novel cleaning approach, \emph{Conformal Data Cleaning} (CDC), combining an application-agnostic ML-based data cleaning approach with conformal prediction (CP). CP is a model-agnostic and distribution-free method to calibrate ML models that give statistical guarantees on their performance, allowing CDC to automatically identify and fix data errors in single cells of heterogeneous tabular data. We demonstrate in extensive empirical evaluations that the proposed approach improves downstream ML tasks in the majority of our experiments. At the same time, it allows full automation and integration in existing ML pipelines. We believe that CDC has the potential to improve data quality with little to no manual effort for researchers and practitioners and, thereby, contribute to more responsible usage of ML technology. Our code is available on GitHub: \emph{redacted GitHub link}.

Submission Length: Long submission (more than 12 pages of main content)

Changes Since Last Submission: Changes requested by 'Reviewer mXky', see answer there: https://openreview.net/forum?id=XFWEvmEyBp&noteId=83gU8GTYDv

Assigned Action Editor: ~Jessica_Schrouff1

Submission Number: 984

Loading