Abstract: Ensuring high-quality data is essential for accurate decision-making in data-driven applications. However, large-scale data pipelines suffer from missing values, inconsistencies, and anomalies due to various sources of errors. We propose an automated data quality assessment and repair system that integrates rule-based validation, probabilistic imputation, and deep learning-based anomaly correction. Our framework continuously monitors data streams, identifies potential quality issues, and applies intelligent repair techniques using self-supervised learning. Extensive experiments on real-world financial and healthcare datasets demonstrate significant improvements in data integrity and downstream machine learning model performance.
Keywords: Data Quality, Automated Data Cleaning, Anomaly Detection, Probabilistic Imputation, Data Governance
Loading