Abstract: Pre-trained language models (PLMs) have established the new paradigm in the field of NLP. For more powerful PLMs, one of the most popular and successful ways is to continuously scale up sizes of the models and the pre-training corpora. These large corpora, typically obtained by converging smaller ones from multiple sources, are thus growing increasingly diverse. However, colossal converged corpora don't always enhance PLMs' performance. In this paper, we identify the disadvantage of heterogeneous corpora from multiple sources for pre-training PLMs. Towards coordinated pre-training on diverse corpora, we further propose Source Prompt (SP), which explicitly prompt the model with the source of data at the pre-training and fine-tuning stages. Extensive experimental results show that pre-training PLMs with SP on diverse corpora significantly improves performance in various downstream tasks.
Loading