The Distributional Hypothesis Does Not Fully Explain the Benefits of Masked Language Model Pretraining

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 MainEveryoneRevisionsBibTeX
Submission Type: Regular Long Paper
Submission Track: Machine Learning for NLP
Submission Track 2: Interpretability, Interactivity, and Analysis of Models for NLP
Keywords: Distributional Hypothesis, MLM, Pretraining
TL;DR: We show that the distributional hypothesis can not provide a complete explanation for the efficacy of MLM pretraining.
Abstract: We analyze the masked language modeling pretraining objective function from the perspective of the Distributional Hypothesis. We investigate whether the better sample efficiency and the better generalization capability of models pretrained with masked language modeling can be attributed to the semantic similarity encoded in the pretraining data's distributional property. Via a synthetic dataset, our analysis suggests that distributional property indeed leads to the better sample efficiency of pretrained masked language models, but does not fully explain the generalization capability. We also conduct an analysis over two real-world datasets and demonstrate that the distributional property does not explain the generalization ability of pretrained natural language models either. Our results illustrate our limited understanding of model pretraining and provide future research directions.
Submission Number: 1265
Loading