Abstract: Web-scale pre-training datasets are the cornerstone of large language models' success. However, text data curated from the internet inevitably contains various types of noise, whose impact on language models needs to be understood. While existing research primarily focuses on low-quality or synthetic data, the \textit{random noise} introduced by unregulated websites or crawler decoding errors has been largely overlooked. This paper \textbf{investigates the influence of such random noise and proposes strategies to mitigate its impact on downstream tasks}. Surprisingly, we observed that the performance degradation rate was significantly lower than the proportion of noise. We provide a theoretical justification for this phenomenon, which also elucidates the success of multilingual models and can be applied to other modalities. To address the adverse effects of noise, we introduce a novel plug-and-play Local Gradient Matching loss, which explicitly enhances the denoising capability of the downstream task head by aligning the gradient of normal and perturbed features to improve local smoothness without requiring knowledge of the model's parameters. Extensive experiments on 8 language and 14 vision benchmarks validate the effectiveness of our proposed method. \footnote{Code, data, and model checkpoint weights are available at \url{https://anonymous.4open.science/r/lmn-acl-E9D3}}
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: Language Modeling
Contribution Types: NLP engineering experiment, Data analysis
Languages Studied: English
Submission Number: 243
Loading