Prior-based Noisy Text Data Filtering: Fast and Strong Alternative For Perplexity

Prior-based Noisy Text Data Filtering: Fast and Strong Alternative For Perplexity

ICLR 2026 Conference Submission19764 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, large language models, pretraining, data filtering, data pruning

TL;DR: Token frequency stats can replace perplexity for LLM data filtering—1000× faster, equally effective.

Abstract: As large language models (LLMs) are pretrained on massive web corpora, careful selection of data becomes essential to ensure effective and efficient learning. While perplexity (PPL)-based filtering has demonstrated strong performance, it suffers from drawbacks: substantial time costs and inherent unreliability of the model when handling noisy or out-of-distribution samples. In this work, we propose a simple yet powerful alternative: a prior-based data filtering method that estimates token priors using corpus-level term frequency statistics, inspired by linguistic insights on word roles and lexical density. Our approach filters documents based on the mean and standard deviation of token priors, serving as a fast proxy to PPL while requiring no model inference. Despite its simplicity, the prior-based filter achieves the highest average performance across 20 downstream benchmarks, while reducing time cost by over 1000× compared to PPL-based filtering. We further demonstrate its applicability to symbolic languages such as code and math, and its dynamic adaptability to multilingual corpora without supervision.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 19764

Loading