GeMQuAD : Generating Multilingual Question Answering Datasets from Large Language Models using Few Shot Learning

Amani Namboori; Shivam Sadashiv Mangale; Andy Rosenbaum; Saleh Soltan

GeMQuAD : Generating Multilingual Question Answering Datasets from Large Language Models using Few Shot Learning

Amani Namboori, Shivam Sadashiv Mangale, Andy Rosenbaum, Saleh Soltan

Published: 30 Oct 2023, Last Modified: 30 Nov 2023SyntheticData4ML 2023 PosterEveryoneRevisionsBibTeX

Keywords: GeMQuAD, Multiligual, Extractive QA, QA, Question Answering, Generative AI, LLM, ICL, FSL, Large Language Models, Few Shot Learning, In Context Learning, Low Cost, Low Resource, Sythetic Data Generation, SDG

TL;DR: GeMQuAD: An iterative semi supervised learning approach to select high quality synthetic samples to improve downstream tasks such as Extractive QA performance in multi-step fine-tuning process in low resource multilingual setting.

Abstract: The emergence of Large Language Models (LLMs) with capabilities like In-Context Learning (ICL) has ushered in new possibilities for data generation across various domains while minimizing the need for extensive data collection and modeling techniques. Researchers have explored ways to use this generated synthetic data to optimize smaller student models for reduced deployment costs and lower latency in downstream tasks. However, ICL-generated data often suffers from low quality as the task specificity is limited with few examples used in ICL. In this paper, we propose GeMQuAD - a semi-supervised learning approach, extending the WeakDAP framework, applied to a dataset generated through ICL with just one example in the target language using AlexaTM 20B Seq2Seq LLM. Through our approach, we iteratively identify high-quality data to enhance model performance, especially for low-resource multilingual setting in the context of Extractive Question Answering task. Our framework surpasses the performance of baseline model trained on an English-only dataset by 5.05/6.50 points in F1/Exact Match(EM) for Hindi and by 3.81/3.69 points in F1/EM for Spanish on MLQA dataset. Notably, our approach uses a pre-trained LLM with no additional fine-tuning of LLM using only one annotated example in ICL to generate data, keeping the development process cost effective.

Submission Number: 67

Loading