Pool-Search-Demonstrate: Improving Data-wrangling LLMs via better in-context examples
Keywords: Large language model, Data wrangling, Foundation model, Database
TL;DR: We enhance data wrangling with foundation models using embedding-based example selection.
Abstract: Data-wrangling is a process that transforms raw data for further analysis and for use in downstream tasks. Recently, it has been shown that foundation models can be successfully used for data-wrangling tasks (Narayan et. al., 2022). An important aspect of data wrangling with LMs is to properly construct prompts for the given task. Within these prompts, a crucial component is the choice of in-context examples. In the previous study of Narayan et. al., demonstration examples are chosen manually by the authors, which may not be scalable to new datasets. In this work, we propose a simple demonstration strategy that individualizes demonstration examples for each input by selecting them from a pool based on their distance in the embedding space. Additionally, we propose a postprocessing method that exploits the embedding of labels under a closed-world assumption. Empirically, our embedding-based example retrieval and postprocessing improve foundation models' performance by up to 84\% over randomly selected examples and 49\% over manually selected examples in the demonstration. Ablation tests reveal the effect of class embeddings, and various factors in demonstration such as quantity, quality, and diversity.
Submission Number: 33