Abstract: We present TarGEN, a multi-step prompting strategy for generating highquality synthetic datasets using LLMs. An advantage of TarGEN is its
seedless nature; it does not require specific task instances, broadening its
applicability beyond task replication. This differentiates it from other data
generation techniques, as it can be leveraged for novel or highly domainspecific tasks with no existing data instances. We augment TarGEN with
a self-correction module that enables LLMs to rectify inaccurately labeled
instances during dataset creation, ensuring reliable labels. To assess our
technique’s effectiveness against existing baselines, we emulate eight tasks
from the SuperGLUE benchmark to create a "synthetic" version and finetune various language models on both synthetic and original training sets.
Evaluation on the original test set reveals that models trained on the synthetic data perform ∼ 1 − 3% higher than those trained on original datasets
Finally, when pre-finetuned on our "synthetic" SuperGLUE dataset, Llama2
(7B) yields impressive results on the OpenLLM leaderboard, surpassing the
model trained on the Self-Instruct dataset by 2.62%. Our analysis reveals
that the synthetic data generated by TarGEN not only improves model
learning, but also has comparable or higher levels of complexity, diversity,
and similar levels of bias in comparison with the original data.
.
Loading