Keywords: deep generative models, semi-supervised learning, knowledge distillation, large language models
Abstract: Semi-Supervised Learning (SSL) has seen success in many application domains, but this success often relies on the availability of task-specific unlabeled data. Knowledge distillation (KD) has enabled compressing deep networks, achieving the best results when distilling knowledge on fresh task-specific unlabeled examples. However, task-specific unlabeled data can be challenging to find, especially for NLP problems. We present a simple framework called "generate, annotate, and learn (GAL)" that uses unconditional language models to synthesize in-domain unlabeled data, helping advance SSL and KD on NLP and tabular tasks. To obtain strong task-specific generative models, we either fine-tune a large language model (LLM) on inputs from specific tasks, or prompt a LLM with a few input examples to generate more unlabeled examples. Then, we use existing classifiers to annotate generated unlabeled examples with pseudo labels, which are used as additional training data or as additional prompts. GAL improves prompt-based few-shot learning on several NLP tasks. It also yields a new state-of-the-art for 6-layer transformers on the GLUE leaderboard. Finally, self-training with GAL offers large gains on four tabular tasks from the UCI repository.
One-sentence Summary: We propose a framework, so-called, GAL to advance self-training, knowledge distillation and few-shot learning on NLP and tabular datasets.
14 Replies
Loading