Automated Data Preparation for Machine Learning: A Survey

Published: 19 Feb 2026, Last Modified: 19 Feb 2026Accepted by DMLREveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Data preparation is essential for effective machine learning (ML), yet typically remains a manual, time-consuming process. While automated machine learning (AutoML) has successfully addressed modeling aspects of the ML workflow, data preparation has largely been overlooked, leading to challenges with real-world, imperfect data. Conversely, a rising paradigm in the world of artificial intelligence (AI) and ML is that of data-centric AI, shifting focus from just refining models, to enhancing data in order to advance performance boundaries. This survey motivates the need for automated solutions regarding data preparation, offering a fundamental understanding of the benefits of data transformations and establishing the complexity of data pipeline optimization, while highlighting the importance of data quality. We provide a comprehensive overview and categorization of existing automation approaches, both in AutoML and as standalone fully or semi-automated systems. We discuss underlying methodologies, their advantages, and limitations. Our work explores the prospects of expanding automation to cover a broader data preparation process, aiming to bridge the gap between data-centric AI and AutoML. It paves the way to a wholly automated pipeline from raw real-world data to quality model predictions, and outlines future research directions towards that goal.
Keywords: Automated data preparation, data preprocessing, data pipeline optimization, machine learning, data quality, AutoML
Changes Since Last Submission: Camera-ready version. Includes very minor changes to wording and formatting.
Assigned Action Editor: ~Peter_Mattson1
Submission Number: 122
Loading