DataMorpher: Automatic Data Transformation Using LLM-Based Zero-Shot Code Generation

Published: 2025, Last Modified: 21 Jan 2026ICDE 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Data transformation is a critical challenge in modern data management systems, particularly when handling complex operations over multiple data sources. However, existing approaches rely on supervised learning, which requires tremendous data labeling and training overhead. To alleviate such overhead while improving accuracy, we demonstrate a novel system DataMorpher that leverages Large Language Models (LLMs) to generate code that transforms source datasets into a user-specified target format. To generate a high-quality and token-efficient prompt, we leverage data profiling to extract features from the source datasets and historical examples of the target data. We also select a subset of features to reduce noise and costs using a ranking algorithm. These selected features are finally translated into a declarative language, which is inspired by SQL's data definition language (DDL), before being added to the prompt. We will demonstrate the workflow and effectiveness of DATAMORPHER using real-world data transformation workflows from Microsoft's GitHub benchmark, smart building, and medical data integration. (A5-min video of our demo is available at https://youtu.be/CuDm46K-_eA.)
Loading