Blessing of Class Diversity in Pre-trainingDownload PDF

29 Sept 2021 (modified: 04 May 2025)ICLR 2022 Conference Withdrawn SubmissionReaders: Everyone
Keywords: representation learning, statistical learning theory
Abstract: This paper presents a new statistical analysis aiming to explain the recent superior achievements of the pre-training techniques in natural language processing (NLP). We prove that when the classes of the pre-training task (e.g., different words in masked language model task) are sufficiently diverse, in the sense that the least singular value of the last linear layer in pre-training is large, then pre-training can significantly improve the sample efficiency of downstream tasks. Inspired by our theory, we propose a new regularization technique that targets the multi-class pre-training: a \emph{diversity regularizer only to the last linear layer} in the pre-training phase. Our empirical results show that this technique consistently boosts the performance of the pre-trained BERT model on different downstream tasks.
One-sentence Summary: This paper presents a new statistical analysis aiming to explain the recent superior achievements of the pre-training techniques in NLP.
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/blessing-of-class-diversity-in-pre-training/code)
17 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview