Curating High Quality Pretraining Data for Language Models via Compression Ratios

Curating High Quality Pretraining Data for Language Models via Compression Ratios

ICLR 2026 Conference Submission21618 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Data Curati

Abstract: The quality of pretraining data determines the capabilities of language models, yet identifying high-quality data among billions of web documents remains computationally prohibitive. We introduce Compel, a simple and scalable data processing step that isolates high-quality text using lightweight, compression-based signals. Our key insight is that the compression ratio of text serves as a robust, model-free proxy for information density: low compression ratios typically reflect repetitive or boilerplate content, whereas high ratios may indicate noisy or unnatural text (e.g., HTML spam or phone numbers). Compel improves dataset quality by retaining only those documents whose compression ratios fall within a chosen range, determined empirically from high-quality reference datasets, without relying on additional model training or heuristic classifiers. Compel improves benchmark performance by around 0.5–1.1% across leading open web-scale datasets - DCLM, FineWeb, and FineWeb-EDU— all while requiring only a fraction of the computational resources of traditional filtering methods. These results show that compression-based filtering is a practical, compute-efficient complement to prevailing quality controls, capable of boosting pretraining data quality.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 21618

Loading