Abstract: Data wrangling (DW) is a fundamental step to prepare data for downstream mining tasks. Recent studies explore large language models (LLMs) to form a lightweight DW paradigm. Such studies typically require prompting an LLM with a DW task together with a few examples as task demonstrations (i.e., in-context learning). A problem yet to be explored is how to select the examples, to maximize task effectiveness given constraints on the size of the examples. To fill this gap, we introduce the constrained Shapley value (CSV), a tailored variant of the Shapley value with a constraint on the LLM prompt size, to guide example selection. We show that CSV has desirable properties in example importance estimation. Using CSV directly for LLM-based DW is still computationally intractable. We further propose activated contribution (ACSV) as an unbiased estimation for CSV and sample allocation algorithms with approximation guarantees. Empirical results show that, compared with DW examples manually selected by experts, CSV improves the effectiveness of LLMs for DW tasks including schema mapping, entity matching, error detection, and missing value imputation by 5.90% averagly in F1 score, demonstrating the general applicability of CSV for in-context learning example selection towards DW tasks.
External IDs:dblp:conf/icde/LiangWDLLTQ25
Loading