Research Area: Data
Keywords: Synthetic Data Generation, Target Data Generation, Seedless Generation, Self Correction
TL;DR: We propose a new method to leverage LLMs to generate high-quality, seedless, diverse synthetic data boosting performance on various NLP tasks.
Abstract: We present TarGEN, a multi-step prompting strategy for generating high-quality synthetic datasets using LLMs. An advantage of TarGEN is its seedless nature; it does not require specific task instances, broadening its applicability beyond task replication. This differentiates it from other data generation techniques, as it can be leveraged for novel or highly domain-specific tasks with no existing data instances.
We augment TarGEN with a self-correction module that enables LLMs to rectify inaccurately labeled instances during dataset creation, ensuring reliable labels. To assess our technique’s effectiveness against existing baselines, we emulate eight tasks from the SuperGLUE benchmark to create a "synthetic" version and finetune various language models on both synthetic and original training sets. Evaluation on the original test set reveals that models trained on the synthetic datasets perform ∼ 1 − 3% points higher than those trained on original datasets. Finally, when pre-finetuned on our "synthetic" SuperGLUE dataset, Llama2 (7B) yields impressive results on the OpenLLM leaderboard, surpassing the model trained on the Self-Instruct dataset by 2.62% points. Our analysis reveals that the synthetic data generated by TarGEN not only improves model learning, but also has comparable or higher levels of complexity, diversity, and similar levels of bias in comparison with the original data.
Supplementary Material: zip
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 1330
Loading