Are the BERT family zero-shot learners? A study on their potential and limitations

Published: 01 Jan 2023, Last Modified: 20 May 2025Artif. Intell. 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Starting from the resurgence of deep learning, language models (LMs) have never been so popular. Through simply increasing model scale and data size, large LMs pre-trained with self-supervision objectives demonstrate awe-inspiring results on both task performance and generalization. At the early stage, supervised fine-tuning is indispensable in adapting pre-trained language models (PLMs) to downstream tasks. Later on, the sustained growth of model capacity and data size, as well as newly presented pre-training techniques, make the PLMs perform well under the few-shot setting, especially in the recent paradigm of prompt-based learning. After witnessing the success of PLMs for few-shot tasks, we propose to further study the potential and limitations of PLMs for the zero-shot setting. We utilize 3 models from the most popular BERT family to launch the empirical study on 20 different datasets. We are surprised to find that some simple strategies (without the need of human efforts or unsupervised data) can yield very promising results on a few widely-used datasets, e.g., 88.34%(±0.60) accuracy on the IMDB dataset, and 84.88%(±2.83) accuracy on the Amazon dataset, which outperforms manually created prompts without engineering in achieving much better and stable performance with the accuracy of 74.06%(±13.04), 75.54%(±11.77) for comparison. However, we also observe some limitations of PLMs under the zero-shot setting, particularly for the language understanding tasks (e.g., GLUE, SuperGLUE).2
Loading