Better Fine-Tuning by Reducing Representational CollapseDownload PDF

Published: 12 Jan 2021, Last Modified: 03 Apr 2024ICLR 2021 PosterReaders: Everyone
Keywords: finetuning, nlp, representational learning, glue
Abstract: Although widely adopted, existing approaches for fine-tuning pre-trained language models have been shown to be unstable across hyper-parameter settings, motivating recent work on trust region methods. In this paper, we present a simplified and efficient method rooted in trust region theory that replaces previously used adversarial objectives with parametric noise (sampling from either a normal or uniform distribution), thereby discouraging representation change during fine-tuning when possible without hurting performance. We also introduce a new analysis to motivate the use of trust region methods more generally, by studying representational collapse; the degradation of generalizable representations from pre-trained models as they are fine-tuned for a specific end task. Extensive experiments show that our fine-tuning method matches or exceeds the performance of previous trust region methods on a range of understanding and generation tasks (including DailyMail/CNN, Gigaword, Reddit TIFU, and the GLUE benchmark), while also being much faster. We also show that it is less prone to representation collapse; the pre-trained models maintain more generalizable representations every time they are fine-tuned.
One-sentence Summary: We present a lightweight augmentation to standard fine-tuning which outperforms previous methods across the board (i.e. SOTA on 3 summarization tasks, XNLI, RoBERTa on GLUE) while being computationally cheaper than other fine-tuning approaches.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Code: [![github](/images/github_icon.svg) pytorch/fairseq](https://github.com/pytorch/fairseq/tree/master/examples/rxf) + [![Papers with Code](/images/pwc_icon.svg) 2 community implementations](https://paperswithcode.com/paper/?openreview=OQ08SN70M1V)
Data: [CNN/Daily Mail](https://paperswithcode.com/dataset/cnn-daily-mail-1), [CoLA](https://paperswithcode.com/dataset/cola), [GLUE](https://paperswithcode.com/dataset/glue), [MRPC](https://paperswithcode.com/dataset/mrpc), [MultiNLI](https://paperswithcode.com/dataset/multinli), [QNLI](https://paperswithcode.com/dataset/qnli), [Reddit](https://paperswithcode.com/dataset/reddit), [Reddit TIFU](https://paperswithcode.com/dataset/reddit-tifu), [SST](https://paperswithcode.com/dataset/sst), [SST-2](https://paperswithcode.com/dataset/sst-2), [XNLI](https://paperswithcode.com/dataset/xnli)
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/arxiv:2008.03156/code)
11 Replies

Loading