Revisiting Few-sample BERT Fine-tuning

Tianyi Zhang; Felix Wu; Arzoo Katiyar; Kilian Q Weinberger; Yoav Artzi

Revisiting Few-sample BERT Fine-tuning

Tianyi Zhang, Felix Wu, Arzoo Katiyar, Kilian Q Weinberger, Yoav Artzi

Published: 12 Jan 2021, Last Modified: 03 Apr 2024ICLR 2021 PosterReaders: Everyone

Keywords: Fine-tuning, Optimization, BERT

Abstract: This paper is a study of fine-tuning of BERT contextual representations, with focus on commonly observed instabilities in few-sample scenarios. We identify several factors that cause this instability: the common use of a non-standard optimization method with biased gradient estimation; the limited applicability of significant parts of the BERT network for down-stream tasks; and the prevalent practice of using a pre-determined, and small number of training iterations. We empirically test the impact of these factors, and identify alternative practices that resolve the commonly observed instability of the process. In light of these observations, we re-visit recently proposed methods to improve few-sample fine-tuning with BERT and re-evaluate their effectiveness. Generally, we observe the impact of these methods diminishes significantly with our modified process.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Supplementary Material: zip

Code: [![github](/images/github_icon.svg) asappresearch/revisit-bert-finetuning](https://github.com/asappresearch/revisit-bert-finetuning)

Data: [CoLA](https://paperswithcode.com/dataset/cola), [GLUE](https://paperswithcode.com/dataset/glue), [MRPC](https://paperswithcode.com/dataset/mrpc), [MultiNLI](https://paperswithcode.com/dataset/multinli), [QNLI](https://paperswithcode.com/dataset/qnli), [SST](https://paperswithcode.com/dataset/sst)

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 6 code implementations](https://www.catalyzex.com/paper/arxiv:2006.05987/code)

10 Replies

Loading