Ensembles and Cocktails: Robust Finetuning for Natural Language GenerationDownload PDF

Published: 28 Jan 2022, Last Modified: 13 Feb 2023ICLR 2022 SubmittedReaders: Everyone
Keywords: finetuning, pretrained language models, natural language generation, robustness, prefix-tuning
Abstract: When finetuning a pretrained language model for natural language generation tasks, one is currently faced with a tradeoff. Lightweight finetuning (e.g., prefix-tuning, adapters), which freezes all or most of the parameters of the pretrained model, has been shown to achieve stronger out-of-distribution (OOD) performance than full finetuning, which tunes all of the parameters. However, lightweight finetuning can underperform full finetuning in-distribution (ID). In this work, we present methods to combine the benefits of full and lightweight finetuning, achieving strong performance both ID and OOD. First, we show that an ensemble of the lightweight and full finetuning models achieves the best of both worlds: performance matching the better of full and lightweight finetuning, both ID and OOD. Second, we show that we can achieve similar improvements using a single model instead of two with our proposed cocktail finetuning, which augments full finetuning via distillation from a lightweight model. Finally, we provide some explanatory theory in a multiclass logistic regression setting with a large number of classes, describing how distillation on ID data can transfer the OOD behavior of one model to another.
One-sentence Summary: We show that one can achieve the best of new "lightweight" finetuning methods' out-of-distribution performance and traditional finetuning's in-distribution performance in a single model.
15 Replies

Loading