Abstract: We study example selection methods for few-shot text-to-SQL tasks with unseen databases. Annotating natural language questions with corresponding SQL queries is expensive, but we can use abundant unlabeled questions to efficiently select examples to annotate and then use them to adapt models. Many previous works only randomly sample a few instances for few-shot learning, but this random selection is not sufficient to select representative and informative examples that provide specific domain knowledge. We thus explore methods to efficiently choose annotation examples. We identify two important factors: the diversity of selected instances and the dissimilarity to the source training data if any. A diverse training set contains more domain knowledge, while dissimilar examples are selected to fill in the domain gap between the source and target. We show that our best example selection approach substantially improves few-shot text-to-SQL performance in both finetuning using T5 and in-context learning with Codex: average execution accuracy gains of 8.7% and 4.3% over random selection. Our extensive analysis demonstrates the importance of the similarity metric and the embedding method for example representations. We also find that effective example selection reduces syntax errors on the target domains. Our results encourage future work to further explore example selection for efficient adaptation of text-to-SQL models.
Paper Type: long
0 Replies
Loading