Do we really have to filter out random noise in pre-training data for language models?

Do we really have to filter out random noise in pre-training data for language models?

ACL ARR 2024 December Submission243 Authors

12 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Web-scale pre-training datasets are the cornerstone of large language models' success. However, text data curated from the internet inevitably contains various types of noise, whose impact on language models needs to be understood. While existing research primarily focuses on low-quality or synthetic data, the \textit{random noise} introduced by unregulated websites or crawler decoding errors has been largely overlooked. This paper \textbf{investigates the influence of such random noise and proposes strategies to mitigate its impact on downstream tasks}. Surprisingly, we observed that the performance degradation rate was significantly lower than the proportion of noise. We provide a theoretical justification for this phenomenon, which also elucidates the success of multilingual models and can be applied to other modalities. To address the adverse effects of noise, we introduce a novel plug-and-play Local Gradient Matching loss, which explicitly enhances the denoising capability of the downstream task head by aligning the gradient of normal and perturbed features to improve local smoothness without requiring knowledge of the model's parameters. Extensive experiments on 8 language and 14 vision benchmarks validate the effectiveness of our proposed method. \footnote{Code, data, and model checkpoint weights are available at \url{https://anonymous.4open.science/r/lmn-acl-E9D3}}

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: Language Modeling

Contribution Types: NLP engineering experiment, Data analysis

Languages Studied: English

Submission Number: 243

Loading