Do we really have to filter out random noise in pre-training data for language models?

ACL ARR 2024 December Submission243 Authors

12 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Web-scale pre-training datasets are the cornerstone of large language models' success. However, text data curated from the internet inevitably contains various types of noise, whose impact on language models needs to be understood. While existing research primarily focuses on low-quality or synthetic data, the \textit{random noise} introduced by unregulated websites or crawler decoding errors has been largely overlooked. This paper \textbf{investigates the influence of such random noise and proposes strategies to mitigate its impact on downstream tasks}. Surprisingly, we observed that the performance degradation rate was significantly lower than the proportion of noise. We provide a theoretical justification for this phenomenon, which also elucidates the success of multilingual models and can be applied to other modalities. To address the adverse effects of noise, we introduce a novel plug-and-play Local Gradient Matching loss, which explicitly enhances the denoising capability of the downstream task head by aligning the gradient of normal and perturbed features to improve local smoothness without requiring knowledge of the model's parameters. Extensive experiments on 8 language and 14 vision benchmarks validate the effectiveness of our proposed method. \footnote{Code, data, and model checkpoint weights are available at \url{https://anonymous.4open.science/r/lmn-acl-E9D3}}
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: Language Modeling
Contribution Types: NLP engineering experiment, Data analysis
Languages Studied: English
Submission Number: 243
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview