Ensembles and Cocktails: Robust Finetuning for Natural Language GenerationDownload PDF

09 Oct 2021, 14:49 (modified: 02 Dec 2021, 08:38)NeurIPS 2021 Workshop DistShift PosterReaders: Everyone
Keywords: lightweight fine-tuning, robustness, pre-trained language model, distribution shifts, generation, NLG
TL;DR: We show that one can achieve the best of new "lightweight" finetuning methods' out-of-distribution performance and traditional finetuning's in-distribution performance in a single model.
Abstract: When finetuning a pretrained language model for natural language generation tasks, one is currently faced with a tradeoff. Lightweight finetuning (e.g., prefix- tuning, adapters), which freezes all or most of the parameters of the pretrained model, has been shown to achieve stronger out-of-distribution (OOD) performance than full finetuning, which tunes all of the parameters. However, lightweight finetuning can underperform full finetuning in-distribution (ID). In this work, we present methods to combine the benefits of full and lightweight finetuning, achieving strong performance both ID and OOD. First, we show that an ensemble of the lightweight and full finetuning models achieves the best of both worlds: performance matching the better of full and lightweight finetuning, both ID and OOD. Second, we show that we can achieve similar improvements using a single model instead of two with our proposed cocktail finetuning, which augments full finetuning via distillation from a lightweight model. Finally, we provide some explanatory theory in a multiclass logistic regression setting with a large number of classes, describing how distillation on ID data can transfer the OOD behavior of one model to another.
1 Reply